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Abstract 


The inefficiency of using an unbiased estimator in a Monte Carlo procedure can be quan¬ 
tified using an inefficiency constant, equal to the product of the variance of the estimator 
and its mean computational cost. We develop methods for obtaining the parameters of the 
importance sampling (IS) change of measure via single- and multi-stage minimization of 
well-known estimators of cross-entropy and the mean square of the IS estimator, as well as of 
new estimators of such a mean square and inefficiency constant. We prove the convergence 
and asymptotic properties of the minimization results in our methods. We show that if a 
zero-variance IS parameter exists, then, under appropriate assumptions, minimization results 
of the new estimators converge to such a parameter at a faster rate than such results of the well- 
known estimators, and a positive definite asymptotic covariance matrix of the minimization 
results of the cross-entropy estimators is four times such a matrix for the well-known mean 
square estimators. We introduce criteria for comparing the asymptotic efficiency of stochastic 
optimization methods, applicable to the minimization methods of estimators considered in 
this work. In our numerical experiments for computing expectations of functionals of an Euler 
scheme, the minimization of the new estimators led to the lowest inefficiency constants and 
variances of the IS estimators, followed by the minimization of the well-known mean square 
estimators, and the cross-entropy ones. 
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Introduction 


In this work we consider the prohlem of estimating an expectation of the form Eq^ (Z), where 
Qi is a prohahility and Z is a Qi-integrahle random variahle. Such expectations are of interest 
in a variety of fields. For instance, they arise as prices of derivatives in mathematical finance 
[19], as committors in molecular dynamics [26, 42, 3], and as probabilities of buffer overflow 
in telecommunications, system failure in dependability modelling, or ruin in insurance risk 
modelling [5]. The Monte Carlo (MC) method relies on approximating such an expectation 
using an average of independent replicates of Z under Qi. The inefficiency of the MC method 
can be quantified using an inefficiency constant, also known as a work-normalized variance 
[23, 20, 45, 6, 7]. We discuss such constants and their interpretations in more detail in Chapter 
2. Efficiency improvement techniques (EITs) (the term having been proposed in [23]) try to 
improve the efficiency of the estimation of the expectation of interest over the crude MC as 
above, e.g. by using some MC method with a lower inefficiency constant. Popular statistical 
EITs include control variates, importance sampling (IS), antithetic variables, and stratified 
sampling: see e.g. [5, 20]. Control variates method relies on generating in an MC method 
replicates of a control variates estimator, equal to the sum of Z and a Qi-zero-mean random 
variable, called a control variate [5,22]. In importance sampling (IS), for a probability Q 2 > called 
an IS distribution, and a random variable L such that Eq^ iZL) = Eqj (Z), called an IS density, 
one computes in an MC method replicates of the IS estimator ZL, under Q 2 . IS has found 
numerous applications among others to the computation of the expectations mentioned 
above and is a useful tool for rare-event simulation [21, 5, 56, 10, 30, 37]. Adaptive EITs use 
the information from the random drawings available to make the estimation method more 
efficient, e.g. by tuning some parameter of the method from some set A c IR^ Eor instance, 
in adaptive control variates one typically tunes the parameter in some parametrization of 
the control variates, while in IS — in some parametrizations b — Q{b) of the IS distributions 
and b — L[b) of the IS densities. Adaptive IS and control variates can have a two-stage form, 
in the first stage of which an adaptive parameter as above is obtained and in the second a 
separate IS or control variates MC procedure is performed using this parameter. Typically in 
the literature adaptive control variates and IS have attempted to find a parameter optimizing 
(i.e. minimizing or maximizing) some function f : A-*U. Erequently, such a function was 
the variance or equivalently the mean square of the adaptive estimator and it was minimized; 
see e.g. [46, 30, 4, 37, 35] for adaptive IS and [22, 40, 32] for control variates. We say that two 
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functions /;, i = 1,2, are positively (negatively) linearly equivalent, if /i = afz + b for some 
linear proportionality constant a e (0,oo) (a e (- 00 ,0)) and e K. In a number of adaptive IS 
approaches it was proposed to maximize a certain function negatively linearly equivalent to 
the cross-entropy distance (also known as Kullback-Leibrer divergence) of the zero variance IS 
distribution (if it exists) from the IS distribution considered [47, 48, 43]. 

We define cross-entropy to be a certain function of the IS parameter, positively linearly equiva¬ 
lent to the cross-entropy distance of the zero-variance IS distribution from the IS distribution 
considered, even though this name is sometimes used in the literature as a synonym of the 
cross-entropy distance [47, 14]. In addition to minimizing the mean square and such a cross¬ 
entropy, in this work we also minimize inefficiency constant. To our knowledge, it is the first 
time when inefficiency constant is being minimized for adaptive MC. One reason why many 
previous works focused on the minimization of variance rather than inefficiency constant 
may be that for some problems considered in these works the mean computation cost was 
approximately constant in the function of the adaptive parameter and thus the inefficiency 
constant and variance were approximately proportional. For instance, this is typically the case 
in parametric adaptive control variates and in parametric IS for many problems of derivative 
pricing in computational finance [19, 30, 37]. However, in numerous current and potential ap¬ 
plications of IS in which the computation of a replicate of the IS estimator involves simulating 
a stochastic process until a random time, the mean cost typically depends on the IS parameter 
and the minimization of the variance and the inefficiency constant is no longer equivalent. 
This is for instance typically the case when performing IS for pricing knock-out barrier options 
in computational finance [19, 30]. Further examples are provided by the molecular dynamics 
applications in which one is interested in computing expectations of various functionals of 
discretizations of diffusions considered until their exit time of some set; see e.g. [56,16] and 
our numerical experiments. See also [21] and references therein for some examples from 
queueing theory and dependability modelling. 

Two types of stochastic optimization methods have typically been used in the literature for 
optimizing some functions / as above. Methods of the first type are stochastic approximation 
algorithms. These are multi-stage stochastic optimization methods using stochastic gradi¬ 
ent descent, in which estimates of the values of gradients of such / are computed in each 
stage. See e.g. [32] for an application of such methods to variance minimization in adaptive 
control variates and [4, 37, 35] in adaptive IS. One problem with such methods is that their 
practical performance heavily depends on the choice of step sizes, and some heuristic tuning 
of them may be needed to achieve a reasonable performance [32]. Stochastic optimization 
methods of the second type rely, in their simplest form, on the optimization of fi — f[b,a)) 
for an appropriate random function f: Ax D.^U (where is the default probability 

space and w e Q is an elementary event). The function / can be thought of as an estimator or a 
stochastic counterpart of /, and thus the methods from this class have been called stochastic 
counterpart methods, alternative names including sample path and sample average approx¬ 
imation methods [28, 32, 34, 53]. See Chapter 6, Section 9, in [53] for a historical review of 
such methods, related to M-estimation and in particular maximum likelihood estimation in 
statistics [55]. The most well-known example of an application of the stochastic counterpart 
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method to efficiency improvement are linearly parametrized control variates [5, 22, 40], in 
which to obtain the control variates parameter one minimizes the sample variance of the 
control variates estimator hy solving a certain system of linear equations. See [46, 47, 48, 30] 
for applications of the stochastic counterpart method to adaptive IS and [32] for an application 
to nonlinearly parametrized control variates. In some works on adaptive IS it was proposed to 
perform a multi-stage stochastic counterpart method (as opposed to the single-stage one as 
above), in which the optimization result from a given stage is used to construct the estimator 
optimized in the subsequent stage [46, 48]. As discussed heuristically in Section 2 in [46], such 
an approach may be better than the single-stage one because the asymptotic distribution of 
the optimization results of the estimators from its final stage may be less spread than when 
using some default estimators in the single-stage case. 

In this work we investigate single- and multi-stage stochastic counterpart methods minimizing 
some well-known estimators of mean square [46, 30] and cross-entropy [47, 48], as well as 
newly proposed estimators of mean square and inefficiency constant. In our theoretical 
analysis we focus on the parametrizations of IS obtained via exponential change of measure 
(ECM) and via linearly parametrized exponential tilting for Gaussian stopped sequences 
(LETGS). Using IS in some special cases of the ECM and LETGS settings has been demonstrated 
to lead to significant variance reductions e.g. in rare event simulation [10, 5] and when pricing 
options in computational finance [30, 37]. We provide sufficient and in some cases also 
necessary assumptions under which there exist unique minimum points of the cross-entropy 
and mean square as well as of their estimators in the ECM and LETGS settings and we give 
some sufficient conditions for these assumptions to hold in the Euler scheme case. It is well 
known that for some important parametrizations of IS the minimum points of the cross¬ 
entropy estimators can be found exactly, which makes these estimators more convenient to 
minimize than the well-known mean square estimators, for the minimization of which one 
typically uses some iterative methods. This is for instance the case in some special cases of the 
ECM setting, in IS for finite support distributions (see examples 3.5 and 3.6 in [48]), and when 
using the Girsanov transformation with a linear parametrization of IS drifts for diffusions [56]. 
We show that this is also the case in the LETGS setting. 

An important contribution of this work is the definition of versions of single- and multi-stage 
minimization methods of the above estimators in the ECM and LETGS settings whose results 
enjoy appropriate strong convergence and asymptotic properties in the limit of the increasing 
budget of the single-stage minimization or the increasing number of stages of the multi¬ 
stage minimization. To ensure such properties of the multi-stage methods we use increasing 
numbers of simulations in the consecutive stages and projections of the minimization results 
onto some bounded sets. Furthermore, in the proofs we apply a new multi-stage strong law of 
large numbers. For the cross-entropy estimators we consider their exact minimization utilising 
formulas for their minimum points, and we prove the a.s. convergence of their minimization 
results to the unique minimum point of cross-entropy. We show that the well-known mean 
square estimators in both settings and the new mean square estimators in the ECM setting are 
convex and we prove the a.s. convergence of the results of their minimization with gradient- 
based stopping criteria to the unique minimum point of mean square. For the new mean 
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square estimators in the LETGS setting and the ones of the inefficiency constant in the ECM 
setting for a constant computation cost, we prove the a.s. convergence of their minimization 
results to the unique minimum point of the mean square when using the following two-phase 
minimization procedure. In its first phase some convex estimator of the mean square as above 
can he minimized, and then, using its minimization result as a starting point, one can carry out 
a constrained minimization of the considered estimator or an unconstrained minimization 
hut of an appropriately modified such estimator. Eor the inefficiency constant estimators 
in the LETGS setting we propose a more complicated three-phase minimization procedure 
with gradient-hased stopping criteria, the first phase of which can he as above. We prove the 
convergence of the minimization results in such a procedure to the set of the first-order critical 
points of the inefficiency constant which have not higher values of the inefficiency constant 
than in the minimum point of the variance, or even by at least some positive constant lower 
such values if the gradient of the inefficiency constant in the minimum point of the variance 
does not vanish. 

Using the theory of the asymptotic behaviour of minimization results of random functions 
from [51], we develop such a theory for the minimization results of such functions when 
using gradient-based stopping criteria. We use it for proving the asymptotic properties of the 
single- and multi-stage minimization methods of the estimators as above. To our knowledge, 
previously in the literature only the strong convergence and asymptotic properties of the 
single-stage minimization of the well-known mean square estimators were proved in [30], 
but only in the limit of the increasing number of simulations, in the ECM setting for normal 
random vectors, under stronger integrability assumptions than in our work, and using exact 
minimization which cannot be implemented in practice as opposed to the minimization with 
gradient-based stopping criteria considered in this work. 

Another important contribution of this work is the definition of the first- and second-order 
criteria for comparing the asymptotic efficiency of certain stochastic optimization methods 
for the minimization of a given function. A method more efficient in the first-order sense 
leads to lower values of the minimized function in the minimization results by at least a 
fixed positive constant with probability going to one as the budget of the method increases. 
The second-order asymptotic efficiency of the minimization methods in which such values 
converge in probability to the same constant can be quantified using some parameters, like the 
means, of some second-order asymptotic distributions of such values around such a constant. 
We apply such criteria to comparing the asymptotic efficiency of the single- and multi-stage 
minimization methods of the estimators discussed above. Eor these methods, the means of 
the distributions as above can be potentially estimated and adaptively minimized. 

We show that if Qi (Z 0) > 0 then there exists a unique IS distribution leading to the lowest 
variance of the IS estimator, which we call the optimal-variance one. If additionally Z > 0, 
Qi a.s., then the optimal-variance IS distribution leads to a zero-variance IS estimator. IS 
parameters leading to such distributions are called optimal-variance or zero-variance ones 
respectively. We show that if there exists an optimal-variance IS parameter for the new mean 
square estimators or a zero-variance one for the inefficiency constant estimators, then under 
appropriate assumptions a.s. the minimization results of the exact single- and multi-stage 
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minimization of such estimators are equal to such respective parameters for a sufficiently large 
simulation budget used. Furthermore, for the single- or multi-stage minimization of these 
estimators with gradient-based stopping criteria we can have a faster rate of convergence of 
the minimization results to such parameters than for the well-known estimators. We also show 
that if there exists a zero-variance IS parameter, then, under appropriate assumptions, the 
asymptotic covariance matrix of the minimization results of the cross-entropy estimators is 
positive definite and is four times such a matrix for the well-known mean square estimators. 
We provide an analytical example in which all possible relations between the asymptotic 
variances (i.e. equalities and both strict inequalities) of the minimization results of different 
types of estimators converging to the same point are achieved for different parameters of the 
example, except that using the cross-entropy estimators always leads to not lower asymptotic 
variance than using the well-known mean square estimators. 

In our numerical experiments we consider an Euler scheme discretization of a diffusion in a 
potential. We address the problem of estimating the moment-generating function (MGF) of 
the exit time of such an Euler scheme of a domain, the probability to exit it by a fixed time, 
and the probabilities to leave it through given parts of the boundary, called committors. Such 
quantities are of interest e.g. in molecular dynamics applications; see [17, 27, 56, 26, 42, 3]. 
We use IS in the LETGS setting, for which under the IS distribution we receive again an Euler 
scheme but this time with an additional drift depending on the IS parameter, called an IS drift. 
Eor the estimation of the above quantities we use a two-stage method as discussed above, in 
the first stage of which to obtain the IS parameter we use simple multi-stage minimization of 
various estimators. In our numerical experiments, the minimization of the new estimators of 
inefficiency constant and mean square led to the lowest variances and inefficiency constants 
of the IS estimators, followed by the minimization of the well-known mean square estimators, 
and of the cross-entropy ones. In one case, the minimization of the inefficiency constant 
estimators outperformed the minimization of the new mean square estimators by arriving at 
a lower mean cost and a higher variance but so that their product, equal to the inefficiency 
constant, was lower. The variances and inefficiency constants of the adaptive IS estimators in 
our experiments strongly depended on the parametrization of the IS drifts used and could be 
reduced by adding appropriate positive constants to the variables Z as above. For a committor 
we also performed experiments comparing the spread of the IS drifts obtained from single- 
stage minimization, which yielded results qualitatively and quantitatively close to the case 
when a zero-variance IS parameter exists as discussed above. We provide some intuitions 
supporting the observed results. 
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2 


Monte Carlo method and inefficiency 
constant 


Let us further in this work denote Np = {p, p+ 1,.N = No, N+ = Ni, and IR+ = (0,oo). For a 
set A £ for some Z e N+, or A e (where SB[B) is the Borel cr-field on B), the default 
measurable space which we shall consider on it is [A, SS{A)), further denoted simply as 5 ^(^). 
Consider a prohahility Qi on a measurable space SA\ - (L2i,.^i) and let Z be an IR-valued 
random variable on SA\ (i.e. a measurable function from SAx to such that Eq^ {|Z|) < oo. 

We are interested in the estimation of a := Eq^IZ). The above defined quantities shall be 
frequently used further in this work. In the Monte Carlo (MC) method, for some n e N+, 
one approximates a using an MC average ccn ■= ^ of independent random variables 

Zi, i = l,...n, each having the same distribution as Z under Qj, shortly called independent 
replicates of Z under Qi. Variance of a„ measures its mean squared error of approximation of 
a, and for var := VarQ^ (Z) we have Var{a„) = ^. 

When performing an MC procedure on a computer it is often the case that there exists a 
nonnegative random variable C on 5Zx such that for generated independent replicates (Z;, C;), 
Z = 1,... n, of (Z, C) under Qi, Q are typically approximately equal to some practical costs, 
like computation times, needed to generate Z/. We call such C a practical cost variable (of 
an MC step). Often we have C = PqC for some p^eR+> which may be different for different 
computers and implementations (shortly, for different practical realizations) considered and a 
random variable C on .Sfi, called a theoretical cost (of an MC step), which is common for these 
practical realizations. In case when the practical costs of generating Z, are approximately 
constant, one can take C = 1. A random C can be e.g. the internal duration time of a stochastic 
process from which Z is computed, like its hitting time of some set. For instance, when pricing 
knock-out barrier options in computational finance using the MC method [19, 30] as such C 
one can typically take the minimum of the hitting time of the asset of the barrier and the expiry 
date of the option. We define a mean theoretical cost c = Eq^ (C) and a theoretical inefficiency 
constant ic = cvar (whenever this product makes sense, i.e. when we do not multiply zero by 
infinity in it), and the practical ones c = Eq^ (C) = PqC and ic = cvar = p(^ ic. For c and var finite, 
practical inefficiency constants are reasonable measures of the inefficiency of MC procedures 
as above, i.e. higher such constants imply lower efficiency. The name inefficiency constant 
was coined in [6, 7], while in some other works such a constant was called a work-normalized 
variance [45]. However, the idea of using a reciprocal of a practical inefficiency constant to 
quantify the efficiency of MC methods was conceived much earlier, see [23] for a historical 
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review. Glynn and Whitt [23] proposed more general criteria for quantifying the asymptotic 
efficiency of simulation estimators using asymptotic efficiency rates and values and the above 
practical inefficiency constant is equal to the reciprocal of their efficiency value in the special 
case of an MC method, in which the efficiency rate equals See [20], Section 10 of Chapter 3 
in [5], or Section 1.1.3 in [19] for accessible descriptions of their approach in the special case 
of MC methods. 


Further on in this chapter we provide some interpretations of inefficiency constants, both 
from the literature and new ones, justifying their utility for quantifying the inefficiency of MC 
procedures. The theorems introduced in the process will be frequently used further on in 
this work. We focus on theoretical inefficiency constants (often dropping further the word 
theoretical), but analogous interpretations hold also for the practical ones. 


The following interpretation of inefficiency constants was given in Section 2.6 in [7]. The ratio 
of positive finite inefficiency constants ic of different sequences of MC procedures as above 
(indexed by the numbers n of replicates used in them) is equal to the limit of ratios of their 
mean costs n^c corresponding to the minimum numbers of replicates rie = needed to 
reduce the variances ^ of the MC averages an below a given threshold e> 0 for c — 0. 

Consider a function / : IR+ such that for each x, y e IR+, fix, y) = fiy, x) and for each 

a e [R+, a/(x,y) = f{ax,ay). Let g: IR^ ^ [0,oo) be such that g(x,y) = so that g(x,y) = 
g(y,x) andgfax, ay) = g(x,y), ne IR+. For instance, /(x,y) canbeequaltomax(x,y), min(x,y), 
or , in which case g(x, y) can be interpreted as the relative difference of x and y. For some 
5 > 0 , we say that x, y e R+ are 5-approximately equal, which we denote as x =5 y, if g(x, y) < 5. 
Note that x =0 y implies that x = y. The below simple interpretations of inefficiency constants 
were given in sections 1.9 and 2.6 of [7] in the special case of / = min as above. For two MC 
procedures for estimating a, one like above using n replicates and an analogous primed one, 
assuming that ic, ic' e IR+, from an easy calculation we have 




and 


nc 1C var var 

g(—) = g(—,—)• 
n c 1 C ^ ^ 


n n' 


( 2 . 1 ) 


( 2 . 2 ) 


In particular, the ratio of positive finite inefficiency constants of these procedures is 6- 
approximately equal to the ratio of the variances of their respective MC averages for 5- 
approximately equal respective mean total costs and it is also 5-approximately equal to the 
ratio of their average costs for ^-approximately equal variances of their MC averages. 


Let (Z;, Ci), i e N+, be independent replicates of (Z, C) under Qi. Before providing further 
interpretations of inefficiency constants, let us recall some basic facts about MC procedures 
as above. From the strong law of large numbers (SLLN), for a„ = h holds a.s. 

lim„^oo = oc, and if var < 00 , then from the central limit theorem (CLT), s/nicin - a) ^ 
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jV[0,var). Consider the following sample variance estimators 


var„ = 


1 nh ‘ 


a: 


n e \ 


(2.3) 


If var < 0, then from the SLLN a.s. lim„^oo var„ = var and if further var > 0, then from Slutsky’s 
lemma (see e.g. Lemma 2.8 in [55]) 


a] ^^{0,1], (2.4) 

V var„ 

which can he used to construct asymptotic confidence intervals for a, as discussed e.g. in 
Chapter 3, Section 1 in [5]. 

For n £ N+, let Cn = ^ L”=i Q and for n e N 2 , let ic„ = Cn^Vn- Assuming that c, var < 00 , from 
the SLLN, a.s. lim„^oo Cn = c and lim„^oo icjj = ic. Let Sn = L”=i C;, n e N (in particular So = 0), 
so that S„ is the cost of generating the first n replicates of Z. For t e IR+, consider 


Nt = sup{n e N : S„ < t], 


(2.5) 


or 


Nt = inf{n e N: Sn > f}. (2.6) 

The above defined Nt are reasonable choices of the numbers of simulations to perform if we 
want to spend an approximate total budget t (like e.g. some internal simulation time) on the 
whole MC procedure. Definition (2.5) ensures that we do not exceed the budget t. Under 
definition (2.6) we let ourselves finish the last computation started before the budget t is 
exceeded and thus we do not waste the computational effort already invested in it. Note that 
under (2.6) we have Nt>0, t e IR+, which does not need to be the case under (2.5). If C < 00 , 
Qi a.s., then a.s. C, < 00 , i e N+, and thus under both definitions a.s. 

lim Nt = 00 . (2.7) 

t^OO 

For some subset A of some set D we denote 11,4 or 11(A) the indicator function of A, i.e. a 
function equal to one on A and to zero on D \ A. For a real-valued random variable Y we denote 
F+ = F1](F > 0) and F- = -F1](F < 0). We have the following well-known slight generalization 
of the ordinary SLLN (see the corollary on page 292 in [8]). 

Theorem 1. If an U-valued random variable Y is such thatE{Y-) < 00 , then for Fi, F 2 ,..., i.i.d. 
~ F, a.s. 

1 ” 

-y F,--E{F)e[Ru{oo}. (2.8) 

c 

Let 0 0 (in particular we can have c = 00 ). Then, from the above lemma a.s. lim„^oo ^ 
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and thus 

lim S„ = oo, (2.9) 

n^oo 

so that under both definitions a.s. 

Nt<oo, t>0. (2.10) 

From renewal theory (see Theorem 5.5.2 in [11]), under definition (2.5) we have a.s. 

. Nt 1 

lim —= -. (2.11) 

t^oo t C 

Since, marking Nt given by (2.6) with a prim, we have Nt< N[< Nt + 1, (2.11) holds also when 
using definition (2.6). 

Let us further consider general N u [oo] -valued random variables Nt, t e IR+. Let m e N+ and 
Y be an IIS'”-valued random vector such that E(T?) < oo, i = 1,..., m, with mean ju = E(F) and 
covariance matrix VF = E((F-/i)(F-p)^). Let Xi,X 2 ,... be i.i.d. ~ F. Let fi„ = 
n e N+. For the string A substituted by each of the strings a, var, ic, c, and p, for p = 2 for A 
substituted by var or ic and for p = 1 otherwise, consider an estimator Af of A corresponding 
to the total budget f e K+ and with an initial value Ao, where Ao e IR'” for A substituted by p 
and Ao e R otherwise, defined as follows 


At = AivJlATf e Np) + AoKAif t Np). 


( 2 . 12 ) 


We shall need the following trivial remark. 

Remark 2. For each keN+, let it bean a.s. N-valued random variable (i.e. P(Tfc e N) = 1) and 
let a.s. limfc^ooTfc = oo. Let further a^, keH,be random variables such that a.s. lim^^oo = ti. 
Then, a.s. limfc^ooHitfc £ = a. 

When we have a.s. (2.7), (2.10), and for some A as above, a.s. lim„^oo then from Remark 

2 , a.s. 


limAf = A. (2.13) 

t^OO 


Lemmas. Leta„, n e N+, beN+-valued random variables such that for some tn e R+, n e N+, 
such fhaf lim„^oo tn = oo, for some beU+, we have ^ b. Then, 

tn 


xfdf 


^M^(0, W). 


(2.14) 


Proof. Using Cramer-Wold device (see page 16 in [55]) it is sufficient to consider the case of 
m = 1, which let us assume. For VF = 0 we have a.s. X; = p, i eM+, so that the thesis is obvious. 
The general case with VF > 0 can be easily inferred from the special case in which p = 0 and 
W = 1, which can be proved analogously as Theorem 7.3.2 in [11]. □ 
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Consider the following condition (which for c e IR+ follows e.g. from (2.11) holding a.s.). 


Condition 4. It holds c e IR+ and 

Nt p 1 
t c' 

For a,b eU,hY a /\ bwe denote their minimum and av b — their maximum. 
Theorem 5. Under Condition 4 we have 
\/l{pt - M) ^ -^(0, cW). 


(2.15) 


(2.16) 


Proof. For each t e IR+, let Mf = (KAlf oo)Nt) v 1, which is an N+-valued random variable, 
equal to Nt when A(f e N+. From Condition 4, it holds 

lim P(Mf = Alf e N+) = 1 

t^OO 

and thus 

Mf p 1 

t c 

Thus, from Lemma 3 

(2.19) 

Let Rt = 'iiNf e N+)f?f = 'i{Nt e N+)\/]V^(/it - n). From (2.17), Rt- Rt ^ 0. Therefore, from 
(2.19) and Slutsky’s lemma, Rt ^ ^(0, W), and thus 

s/cRt^jY{b,cW). (2.20) 

Let Gt = Vtift - r) and Gf = 'i{Nt e N+)Gt. Then, Gt-Gt ^ 0, so that to prove (2.16) it is 
sufficient to prove that 

Gt^J>^(.0,cW). ( 2 . 21 ) 

From (2.17), the continuous mapping theorem, and Slutsky’s lemma, St := ll(A(f e 
1 . Thus, ( 2 . 21 ) follows from (2.20) and the fact that from Slutsky’s lemma 

^^Rt-Gt = V^Rt^l-St)^0. ( 2 . 22 ) 


(2.17) 


(2.18) 


□ 


In the below theorem and remark we extend the interpretations of inefficiency constants 
provided at the beginning of Section 10, Chapter 3 in [5] (see also [20] and Example 1 in [23]). 
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Chapter 2. Monte Carlo method and inefficiency constant 

Theorem 6. If var < oo and Condition 4 holds, then 

\fiidt-a)^ ,y^iO,ic). (2.23) 

If we further have war > 0 and a.s. (2.7) and (2.10), then 

(2.24) 

Proof Formula (2.23) follows immediately from Theorem 5 and (2.24) follows from (2.23), 
(2.13) holding a.s. for A = ic, and Slutsky’s lemma. □ 

Remark 7. LetX ~ JV{0, 1) and let for f e (0,1), zp be the f-quantile of the normal distribution, 
i.e. P(X< zp) = f. Letje (0,1) and py = Zj_r, so thatV’{\X\ < py) = l-y. Assuming (2.24), for 

the random interval /y,r = (df - py\J~^,at + Pj\f^) tve have 

lim P(a e ly i) = P(l^l ^ Py) = 1-7, (2.25) 

i.e. ly^t is an asymptotic 1 - 7 confidence interval for a. It follows that iCf and dt can play 
the same role when constructing the asymptotic confidence intervals for a for t-*oo, as var„ 
and dn do for n^oo as discussed below (2.4). For C = 1, both approaches to constructing the 
asymptotic confidence intervals are equivalent. 
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3 


Importance sampling 


3.1 Background on densities 

Consider a measurable space Sf' = (D,@), let pi and p 2 be measures on Sf, and let A^'3). We 
say that has a density L (also a called Radon-Nikodym derivative) with respect to [i 2 on A, 
which we denote as L = if C is a measurable function from 6^ to 5^iU) such that for 

each B e 0, Pi (A n R) = /L11(A n B] dp 2 - If C = i^)A, then for each measurable function / 
from 5^ to 5^(U) such that that 11 yi/ is nonnegative or pi-integrable, it holds 


j KA)/dpi = j KA)/Ldp2. 


(3.1) 


Such an L is uniquely dehned p 2 a-e. on A, i.e. for some L’: 5^ 5^[M] we also have L’ = i^)A 

only if L' = L, p 2 a.e. on A (i.e. if p 2 ({i^ = Z,} n A) = p 2 (^))- Furthermore, such an L is p 2 a.e. 
nonnegative on A. We say that pi is absolutely continuous with respect to p 2 on A, which we 
denote as pi p 2 > if for each Be Si, from i^ziAnB) = 0 it follows that pi(AnB) = 0. We say 
that Pi and p 2 are mutually absolutely continuous on A if pi «yi p 2 and p 2 «yi pi, which we 
also denote as pi -yi p 2 . If L = {^) a exists, then it holds pi «yi p 2 . We say that a measure p 
on ,5^ is u-finite on A if A is a countable union of sets from 0 with p-finite measure. Note that 
if p is a probability distribution then it is u-finite on A. From the Radon-Nikodym theorem, if 
Pi and p 2 are u-finite on A and pi «a P 2 > then exists. 


Lemmas. LetL= (^)a- Then, pi ~yi p 2 onZy «/p 2 ({i = 0} n A) = 0, in which case 

,d^ij 

Proof. If P 2 ({L = 0} n A) = 0, then for BeS, from (3.1), 


(3.2) 


J l(AnB)dp2 = J L^^^^l(AnB)dp2 = J 


1(17^0) 


11(AnB)dpi 


(3.3) 


so that we have (3.2) and pi ~yi p 2 . On the other hand, since pi({i = 0} n A) = /L11({L = 
0} n A) dp 2 = 0, if p 2 ({i = 0} n A) > 0 then we cannot have p 2 «a Pi- □ 

For A = D we omit A in the above notations, e.g. we write pi « p 2 , pi ~ P 2 , and I = We 
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say that q isa. random condition on ^ if {x e D : q{x]} e Often the event {x e D : q{x]} 
will he denoted simply as {q} and we shall frequently write q in the place of {q} in various 
notations. 


3.2 IS and zero- and optimal-variance IS distributions 

If for some prohahility Q 2 on 5^^, Qi «z^o Q 2 > then for L = we have a = 

Importance sampling (IS) relies on estimating a hy using in an MC method independent 
replicates of such an IS estimator ZL under Q 2 . The variance of the IS estimator fulfills 


Varq,, (ZL) = Eq, ((ZL)^) - (Z^L) - a^. (3.4) 

Conditions. ItholdsQ^ «z^o Q 2 andforsomeL= (^)zf^o we haveVaxQ^iZL) = 0 or equiv¬ 
alently ws. ZL = a. 

Theorem 10. Condition 9 holds only if it holds with for some’ replaced by ’for each’. 

Proof Let L be as in Condition 9 and L' = (^)z^o- Then, L = L', (Q )2 a.s. on Z 7 ^ 0 and 
0 = ZL = ZL' on Z = 0. Thus, from (Q )2 a.s. ZL = a it also holds Q 2 a.s. ZL' = a. □ 


Condition 11. It holds Qi (Z 7 ^ 0) > 0. 

Condition 12. It holds Qi (Z 7 ^ 0) > 0 and either Qi a.s. Z > 0 or Qi a.s. Z < 0. 
Theorem 13. If Condition 12 holds, then for a probability Q* given by 


dQi a 

Condition 9 holds for Q 2 = Q* • Furthermore, Q* (Z 7 ^ 0) = 1 and Q* ~z^o Qi with 


* cc dQi 


(3.5) 


(3.6) 


Proof Condition 12 implies that Q* is well-defined. Furthermore, (Q>*(Z 7 ^ 0) = Eq^I^) = 1 
and from Lemma 8 we have (3.6). In particular, ZL* = a, Q* a.s., that is Condition 9 holds for 

Q2 = Q*. □ 


Lemma 14. Assuming Condition 11, if there exists a probability Q 2 fulfilling Condition 9, then 
Condition 12 holds and such (Q )2 is equal to the probability Q* as in Theorem 13. 

Proof. Let conditions 9 and 11 hold. Then, Q 2 a.s. 

a dQi 

1(Z7^0)- = 1(Z7^0)L=(-^)z^o. (3.7) 

z dQ2 

Thus, from Condition 11, Eq^ [KZ 7 ^ 0)^) = Qi(Z 7 ^ 0) > 0, which implies that a 7 ^ 0. From (3.7) 
and Lemma 8 we have Qi ~z^o Q 2 and f = (^)z^o- Thus, Q 2 (Z 7 ^ 0) = Eqj(11(Z 7 ^ 0)f) = 1 
and ^ ^. In particular, Qi a.s. Zsgn(a) > 0 and thus Condition 12 holds and Q 2 = Q* • □ 
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3.3. Mean cost and inefficiency constant in IS 


Theorem 15. If Condition 12 holds, then the probability Q* as in Theorem 13 is the unique 
probability Q 2 for which Condition 9 holds. 

Proof. Since Condition 12 implies Condition 11 and Theorem 13 implies the existence of Q 2 
fulfilling Condition 9, from Lemma 14, Q 2 = Q* • □ 


We shall call the prohahility Q* as in Theorem 13 the zero-variance IS distrihution. Assuming 
that L = from (3.4), 

VarQfoZL\) = EQfZ^L)-{Ei}foZ\)f, (3.8) 

and thus 


VarQ, [ZD = VarQ, (|ZL|) + (Eq^ [\Z\)f - > (Eq^ [\Z\)f - a\ 


(3.9) 


with equality holding only if Condition 9 holds for |Z| (i.e. for Z replaced hy |Z| and in 
particular for a replaced hy Eqj (|Z|)). Let Condition 11 hold. Then, Condition 12 holds for |Z| 
and from Theorem 15, Q* as in Theorem 13 hut for Z replaced hy |Z|, i.e. such that 


dQ* _ |Z| 

dQi ~ Eq^CZD’ 


(3.10) 


is the unique prohahility Q 2 for which Condition 9 holds for |Z|. The fact that Condition 9 
holds for |Z| for such a Q* is well-known, see e.g. Theorem 1.2 in Chapter Vin [5], hut the 
uniqueness result is to our knowledge new. Furthermore, we have 


L* :=UZ^0) 


Eqi(|Z|) 

|Z| 


= ( 


dQ- 


-)z/o- 


(3.11) 


Note that from Condition 9 holding for |Z| and (3.8), Q* a.s. (or equivalently Qi a.s. on Z 7 ^ 0) 


\Z\L* = Eq^(IZI) = ^Eqj(Z2I*). 


(3.12) 


We call such a Q* the optimal-variance IS distrihution. Under Condition 12 the optimal- 
variance IS distrihution is also the zero-variance one. In some places in the literature our 
optimal-variance IS distrihution is called simply the optimal IS distrihution (see e.g. page 127 
in [5]). However, since as argued in Chapter 2 it may he more optimal to minimize inefficiency 
constant than variance and the optimal-variance IS distrihution does not need to lead to the 
lowest inefficiency constant achievable via IS, calling it optimal may be misleading. 


3.3 Mean cost and inefficiency constant in IS 

Let L = (|^)z^o and let C be a nonnegative (theoretical) cost variable on 5^^ for computing 
replicates of ZL under Q 2 . We shall consider C to be the same for different Q 2 under consid¬ 
eration. The mean cost under Q 2 is Eq^ (C) and such a (theoretical) inefficiency constant is 


23 



Chapter 3. Importance sampling 


VarQ,(ZL)EQ,{C) (3.13) 

(assuming that it is well-defined). 

Note that if the zero-variance IS distribution Q* exists and the mean cost Eq* (C) is finite, then 
the inefficiency constant under Q* is zero. 

The helow theorem provides an intuition why in our numerical experiments in Chapter 10, 
for some a e [0,oo) and s e IR+, for a nonincreasing function /(x) = Ulx < s) -i- a and a strictly 
decreasing one /(x) = exp(-sx), and for Z = /(C), we observed mean cost reduction after 
changing the initial distribution to a one in a sense closer to the respective zero-variance IS 
distribution Q*. 

Theorem 16. Let -* ^([0,oo)), Z = /(C), E(Q)j(Z) e IR+, andEq^iC] < oo. Let<Q* be the 

zero-variance IS distribution. 


1. Iff is nonincreasing, then 

Eq. (C) <Eiqij(C), (3.14) 

and if further for some 0 < xi< X 2 <oo we have f{x\) > fixz), Qi(C e [0,xi]) > 0, and 
Qi (C e [X 2 , oo)) > 0 (which is the case e.g. iff is strictly decreasing and C is not Qi a.s. 
constant), then the inequality in (3.14) is sharp. 

2. Iff is nondecreasing, then 

Eq. (C) >E((Jj(C), (3.15) 

and if further for some 0 <Xi<X 2 <oowe have f{x\) < /(X 2 ), Qi(C e [0,xi]) > 0, and 
Qi (C e [x 2 ,oo)) > 0, then the inequality in (3.15) is sharp. 


Proof. From (3.5) we have 


Eq.(C) 


Eq//(C)C) 

Eqj(/(C)) ■ 


For Cl and C 2 being independent replicates of C under Qi, we have 


Eqi (/(C)C) - Eqj (C)Eq^ (/(C)) = ^Eq^ ((/(Cl) - /(C2))(Cl - C2)), 


(3.16) 


(3.17) 


which is nonpositive if / is nonincreasing and negative under the additional assumptions of 
point one, or nonnegative if / is nondecreasing and positive under the additional assumptions 
of point two. From this and (3.16), the thesis easily follows. □ 
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3.4 Parametric IS 

For some nonempty set A, let us consider a family <Q[b), be A, of probability distributions on 
SA\. Typically, we shall assume that for some leN+ 

AeSSm^). (3.18) 

Consider a function L: A x D.iU, for which we denote Lib) = Lib, •), b e A. If the following 
condition is fulfilled, then for each be A one can perform IS using the IS distribution Qib) and 
density Lib) as in Section 3.2. 

Condition 17. It holds Lib) = ( be A. 

For xi and X 2 being two cr-fields, measurable spaces, or measures, by xi i8> X 2 we denote their 
product c7-field, measurable space, or measure respectively, while for n e N+, by x” we mean 
such an n-fold product of Xi. The following conditions will be useful further on. 

ConditionlS. We have (3.18) and L is measurable from 5^ iA)iS>S^i toS^iW). 

Condition 19. We have (3.18) and a probability Pi on a measurable space 'ifi and if : 'ifi i8> 
I/’ (A) — 5^1 are such that for each be A 

Qib)iB) = Piia;b)-^[B]), Bet^i, (3.19) 

or equivalently, for each random variableX ~ Pi, f(X,h) ~ Qib), be A. 

Remark 20. Let conditions 17, 18, and 19 hold and let b be some A-valued random vari¬ 
able, which can be e.g. some adaptively obtained IS parameter. Let fi ~ Pi, i e N+, he i.i.d. 
and independent ofb. Then, from Fubini’s theorem it follows that the random variables 
^n = ^ 'L'i=iiZ Lib)) (f ifi, b)),neN+, are unbiased and strongly consistent estimators of a, i.e. 
E(a„) = a, n e N+, and a.s. lim„^oo = oc. 

In the further sections we shall often deal with families of distributions and densities satisfying 
the following condition. 

Condition 21. AsetBi e is such that we have f^ib) ~Bi Qi and Lib) = (^^)bii be A. 

Let us formulate separately the special important case of the above condition. 

Condition 22. Condition 21 holds for Bi = Qi, or equivalently Qib) ~ Qi and Lib) = 
be A. 

The following condition will be useful to avoid different technical problems like when dividing 
by L or taking its logarithm. 

Condition 23. It holds L(h) (a») > 0, h e A, to e Qi. 

Condition 24. Condition 21 holds, Qi({Z 0} \ Bi) = 0, andQib)i{Z 0} \ Bi) = 0,be A. 
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Remark25. Note that for Z such that Condition 24 holds, wehaveQib) ~z^ouBi Qi> be A,and 


be a, (3.20) 

so that Condition 17 holds. 

Definition 26. We say that b* e A is a zero-variance (optimal-variance) IS parameter if Condi¬ 
tion 12 (Condition 11) holds and Qib*) is the zero-variance (optimal-variance) IS distribution. 

Note that in the literature the name optimal IS parameter is sometimes used for the parameter 
minimizing the variance be A—^ VarQfj,) {ZL{b)) of the IS estimator (see e.g. [35]), which may 
be not equal to an optimal-variance IS parameter in the sense of the above definition. 

The below theorem characterizes the random variables Z as above for which there exists a 
zero-variance IS parameter, under some of the above conditions. 

Theorem 27. Let us assume Condition 21. Then, Condition 24 holds and there exists a zero- 
variance IS parameterbi (for which wedenoteQ* = Q(hi)J, only if for some bz e A, Q(b 2 )(Bi} = 1 
and for some feU\0, 

Z = lB^l(L(h 2 )^ 0 )-^, Qia.s. (3.21) 

Khz) 

Furthermore, in the latter case we have f = a andQib 2 ) = <Q>*. 

Proof. Let us first show the right implication. From Condition 24 and Q* = Q{b\) it follows for 
b 2 = hi that Q(h 2 )( 5 i) = Q* (Bj) = Q* (Z ^ 0 n Bi) = Q* (Z 0) - Q(h 2 )({Z 0} \ Bj) = 1. From 

(3.6), Q* a.s. I(h 2 )Z = KZ 0)a, which from Q* ~z^o Qi holds also Qi a.s. Thus, since from 
Condition 12 we have a 0, it holds Qi a.s. that if Z 0 then also L{b 2 ) 0. Therefore, we 

have Qi a.s. Z = KZ 0 a L(h 2 ) 7 ^ Thus, from Condition 24, 

Z = 1bi—^ l(Z 7 ^ 0 AL(h 2 ) 7 ^ 0 )a, Qia.s. (3.22) 

i(h 2 ) 

From Condition 21, Q* ~Bi Qi, and thus from Lemma 8 and (3.5), 0 = Qi({L(h 2 ) = 0} nBi) = 
Qi({Z = 0}nBi), so that from (3.22) and (Q>i(Z 7 ^ 0) > 0 we have (3.21) only for f) = a. 

For the left implication note that for Z as in (3.21) Condition 24 holds. Furthermore, from 
Condition 21 and Lemma 8 , 

Q(h 2 )(Bi n {L(h 2 ) 7 ^ 0}) = Q(h 2 )(Bi). (3.23) 

From (3.21), (3.23), and (Q)(h 2 )(Bi) = 1 

a = EQ(i,,)(ZL(h 2 )) = )SQ(h 2 )(Bi n {L{b 2 ) = (3.24) 
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so that Condition 12 holds. We have 


dQ{b2] 

dQi 


Ub2) 


Z _ dQ* 


(3.25) 


where in the first equality we used Condition 21, (Q)(fi 2 ) (^i) = and Lemma 8, in the second 
(3.21), and in the last (3.24) and (3.5). □ 


Remark 28. From the discussion in Section 3.2, the optimal-variance IS distribution for Z is 
the zero-variance one for \Z\. Thus, from the above theorem for Z replacedhy\Z\ wereceivea 
characterization of variables Z for which there exists an optimal-variance IS parameter under 
certain assumptions. 
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The minimized functions and their 
estimators 


4.1 The minimized functions 

For some nonempty set A, consider a family of probability distributions as in Section 3.4 for 
which Condition 17 bolds. Assuming Condition 23 and that 

EQj((Zln(L(h)))_)<oo, be A, (4.1) 

we define a cross-entropy (function) ce: A — IR u {oo} as 

ce(h) = EQj(Zln(L{h))), be A (4.2) 

(see tbe discussion in Chapter 1 regarding its name). 


Remark 29. Let us discuss how ceib) is related to a certain f-divergence of the zero-variance IS 
distribution from Q[b). For some convex function f: [0,oo) — R, the f -divergence d[P\,P 2 ) of a 
probability R 2 from another one Pi such that R 2 « Pi is given by the formula 


d(Pi,P2) = Epi(/{ 


dP2 

d¥‘i 


)). 


(4.3) 


Such an f -divergence is also known as Csiszdr f -divergence or Ali-Silvey distance [43,2, 38]. 
From Jensen’s inequality we have d(Pi, P 2 ) > /(I), and iff is strictly convex then the equality in 
this inequality holds only ifPi = P 2 . For example, for the strictly convex function f[x) = xln(x) 
(which we assume to be zero for x = 0), d(Pi,P 2 ) is called Kullback-Leibler divergence or cross¬ 
entropy distance (of P 2 from Pi), while for fix) = (x^-1), d(Pi,P 2 ) is called Pearson divergence. 
For d denoting the cross-entropy distance, let us assume Condition 12, so that the zero-variance 
IS distribution Q* exists, Q* « Q{b), be A, and 


dmb),Q*) = P(gb)i 


ln(- 


dQib) dQib) 


-)) 


d<Q* 

= EQ.(ln(-^)) 
dQib) 


= EQ^(-(ln(^) + ln(I(h)))), 


(4.4) 
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From xln(x) > -e~^ we haveEq^ (Zln(Z)) > -e~^. Assuming further that 

Eqj (Zln(Z)) < oo, (4.7) 

we receive from (4.4) that 

d((Q>(fo),(Q)*) = a“^(EQj(Zln(Z))-aln{a) + ce{h)). (4.8) 

If (4.8) holds as above for each b e A, thenb^ceib) andfo —d(Q(fo),Q*) are positively linearly 
equivalent (see Chapter 1). Note that from the discussion leading to formula (4.8) and from 
d(Q(fo),Q*) > 0, a sufficient assumption for (4.1) to hold is that we have Z > 0 and (4.7). 

We define the mean square of the IS estimator as 

msq(fi) = iiZUb)f) = Eq^ {Z^L{b)), be A, (4.9) 

and such a variance as 

var{fi) = msq{fi) - be A. (4.10) 

Remark 30. Assuming that Condition 11 holds, for Q* denoting the optimal-variance IS distri¬ 
bution as in Section 3.2 and d denoting the Pearson divergence as in Remark 29, from (4.5) and 
(3.10) we have for be A 

IZI 

dmb),Q*) = EQm(( \ ', Lib)f - 1) 

EQidZI) 

Thus, in such a case be A^ d(Q(fi),Q*) is positively linearly equivalent to ms(\ and var. 

Let C he some [0, oo] -valued theoretical cost variable on . Let c[b) = EQ(i,) (C) he the mean 
cost under Qib), be A. 

Condition 31. For each b e A, it does not hold c{b) = oo and var(fi) = 0, or c{b) = 0 and 
var(fi) = oo. 
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Assuming Condition 31, we define a (theoretical) inefficiency constant as 

ic(fi) = c(fi)var{fi), be A. (4.12) 

Frequently, the proportionality constants P(^ of the practical to the theoretical costs of the 
IS MC as in Chapter 2 can he chosen the same for different IS parameters be A, so that the 
practical and theoretical inefficiency constants are proportional and their minimization is 
equivalent. 


4.2 Estimators of the minimized functions 

Consider a family of prohahility distributions as in Section 3.4 and let us assume that condi¬ 
tions 17 and 18 hold. Consider a measurhle function / :5^{A) — and for some p e N+, 
consider 

^t„:^(A)2<8>^i"-.5^(IR), neNp, (4.13) 

called estimators of /, where ^t„ib’, b) is thought of as an estimator of fib) under Q(fi')”, 
b,b' e A, ne Np. In all this work, for b' e A, we denote Q' = Qib’) and L' = L{b'). We say that 
some ^t„ as above is an unbiased estimator of / if 

/(fi) = E(q,-)-.(^t„(fi',fi)), b',beA. (4.14) 

Let us further in this section assume the following condition. 


Condition 32. We have b' e andK\,K2,..., are i.i.d. ~ Q' andxn = n e N+. 


We call the estimators est„, n e Np, strongly consistent for / if for each b', be A, a.s. 

lim estnib’,b)iK„) = fib). (4.15) 

n^oo 

For a function Y on Qi (like e.g. Z or L), we define such functions Fi,..., on by the 
formula 


Yiia)) = Yim), w=(w,)”=ieO”, 

and whenever F takes values in some linear space we denote 


_ 1 « 

iY)„ = - X 


i=l 


(4.16) 


(4.17) 


(4.18) 


For the cross-entropy as in the previous section, assuming (4.1), we have 
ce(fi) = EQdl'Zln(L(fi))), 
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so that for n e N+, from Theorem 1, its unbiased strongly consistent estimators are 

ce„(h',h) = (I'Zln(L{h)))„ = - 7 L;.Z;ln(L,(h)). (4.19) 

For mean square, we have 

msq(h) = E<q,/ {Z^L'L[b)), (4.20) 

so that for n e N+, its unbiased strongly consistent estimators are 


rn5q„(h', b) = {Z^L’Lmn. (4.21) 


The above mean square estimators and estimators negatively linearly equivalent to the above 
cross-entropy estimators in the function of b (see Chapter 1) have been considered before 
in the literature; see e.g. [46, 47, 48, 30]. Thus, we call the above estimators well-known. We 
shall now proceed to define some new estimators. If <Q{b) « Q', then for variance, we have for 
neN2 


var(h) = Var<Q,(i,)(ZL(h)) 
/ 


1 


1 


«(«-!) . n] 


X [ZiLim-ZjLj{b)Y 


(4.22) 


n{n-l) 


-E 


(Q')" 


^ {dm)\ {dm\ 

V;<;e{l,...,n}'- “V Ij 


Let us further in this section assume conditions 22 and 23. Then, 
we have the following unbiased estimators of var for n e N 2 

1 


dQm _ Jl_ 
rfQ' “ L(b) 


, and from (4.22), 


var„{b',b) = 


l'.l'. 


Ini 


1 


n{n-l) 
I 




V;=iv 


■ I 2ZiZjL’^L’j 

i<j£{l,...,n] 


(4.23) 


n 


n-l 


msq„(h',h) 


f—1 

[Lib) I, 


-iZL'), 


Thus, b var„ [b’, b) is positively linearly equivalent to the following estimator of mean square 


/ L' 

KUb)j„’ 


msq2„(h',h) = msq„(h',h)| 
which can be considered also for n = l. From the facts that from the SLLN, a.s. 


(4.24) 


lim iZL')„iK„) = Eo'iZL') = a 


(4.25) 
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4.2. Estimators of the minimized functions 




estimators msq2„ and var„ are strongly consistent for msq and var respectively. Let us further 
in this section assume that 

Q(h)(C = oo) = 0, be A. (4.27) 

Then, strongly consistent and unbiased estimators of the mean cost c are 


Cn{b',b) = m = l,...,n) —- 11 (C 5 ^oo)C 
\Ub) 


Let us further in this section assume Condition 31. Then, strongly consistent estimators of ic 
are for n e N 2 , 

ic„{b’,b) = c„ib’,b)-^rnib’,b), (4.29) 


which are in general not unbiased. For each n e N 3 , defining helper unbiased estimators of 
variance for fc = 1 ,..., n 

1 

m„,kib',b) = - ---- X ^ \ {ZiLi{b)-ZjLjib)f , (4.30) 

(« - 1)(« - 2) U ,-e{C:,nl\W (^)i; (^) 


we have the following unbiased estimator of ic 

ic 2 „(h',h) = - ^ [-^ll(C 5 ^oo)c] var„,fc(h',h). 

n^iU(h) Ik 
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5 


Examples of parametrizations of IS 


In this chapter we introduce a number of parametrizations of IS, most of which shall he used 
in the theoretical reasonings or numerical experiments in this work. 

5.1 Exponential change of measure 

Exponential change of measure (ECM), also known as exponential tilting, is a popular method 
for obtaining a family of IS distributions from a given one. It has found numerous applications 
among others in IS for rare event simulation [10, 5] or for pricing derivatives in computational 
finance [30, 37]. In this work by default all vectors (including gradients of functions) are 
considered to be column vectors. Eor some Z e N+, consider an -valued random vector 
X on Sf\. We define the moment-generating function as e = Eqj (exp(fi^X)). 

Let A be the set of all b eU^ for which (b[b) < oo. Note that 0 e A and from the convexity 
of the exponential function, A is convex. The cumulant generating function is defined as 
'i’{b) = lnmb)),beA. 

Condition 33. For each b\,b 2 ^ A such that bi^ b 2 , {bi-b 2 )^ X is not Qi a.s. constant. 
Lemma 34. 'P is convex on A and it is strictly convex on A only if Condition 33 holds. 

Proof. Let bi,b 2 ^ A and qi,q 2 ^ IR+ be such that q\ + q 2 = l. Erom Holder’s inequality 
2 2 

i=\ i=l 

and taking the logarithms of the both sides we receive 

2 2 

'^{Y^qibi)<Y.di'^^^i^- (5.2) 

i=l 1=1 

Thus, 'P is convex. Equality in (5.1) or equivalently in (5.2) holds only if for some a e IR+, 
Qi a.s. exp(fi^X) = aexp(hJX) (see page 63 in [50]) or equivalently if for some c e K, Qi a.s. 
[bi - b 2 ) ^ X = c. 'P is strictly convex only if there do not exist b\,b 2 ^ A, b\i^h 2 , such that an 
equality in (5.2) holds, and thus only if Condition 33 holds. □ 

Condition 35. For each teU^\ [0], t^X is notQy a.s. constant. 
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Note that Condition 35 implies Condition 33 and if A has a nonempty interior then these 
conditions are equivalent. If A contains some neighbourhood of zero, then X has finite all 
mixed moments, i.e. E(n!_i <oo, ve nK For v e let us denote d„ = ■ 

Condition 36. A is open, <5 is smooth (i.e. infinitely continuously differentiable) on A, and for 
each ve we have 

i 

d^mb) = EQ^id^expib^X)) = Eq^lexpib^X) f] (5.3) 

1=1 

Remark 37. It is easy to show using inductively the mean value theorem and Lebesgue’s domi¬ 
nated convergence theorem that Condition 36 holds when A = U^ or when Qj ([0,oo)^) = 1 and 
for some A > 0, = (-oo. A) ^ 

We define the exponentially tilted family of prohahility distributions Q(fi), be A, corresponding 
to the above Qi and X by the formula 

dQib) , T’ 

-f— = exp[b^X-'¥[b)), be A. (5.4) 

dQi 


Note that QCO) = Qi and 


T M'U'l 

Lib) :=exp(-fi^X + 'F(fi)) = , 

d<Q[b) 


be A. 


(5.5) 


Note that conditions 18, 22, and 23 hold for the above Qib) and Lib), be A. From Lemma 
34, for each o) e Lli, b e A^ L(fi)(w) is log-convex (and thus also convex) and if Condition 
33 holds, then it is strictly log-convex (and thus also strictly convex). Let us define means 
pib) = E((j(j,)(X) and covariance matrices Z(fi) = EQ(;,)((X-p(fi))(X-p(fi))^), for fie A for which 
they exist. Note that the functions d>, 'F, Z, and p depend only on the law of X under Qi. If for 
some fi e A it holds Z(fi) e then we have t^Z(fi)f = Eqt^hfit^iX- pib)))Z, t e and thus 
Z(fi) is positive definite only if Condition 35 holds. When Condition 36 holds, then we receive 
by direct calculation that V'P(fi) = pib) and V^'F(fi) = Z(fi), fi e A. 

Let U be an open subset of The following well-known lemma is an easy consequence of 
the inverse function theorem. 

Lemma 38. Iff : [/ — R^ is injective and differentiable with an invertible derivative Df on U, 
then f is a diffeomorphism of the open sets U and fill). 

By I • I we denote the standard Euclidean norm. 

Lemma 39. IfU is convex and a function g: 1/ — R is strictly convex and dijferentiable, then 
the function beU ^ Vg(fi) is injective. 

Proof. If for some fii, fi 2 e U, fii ^ fi 2 , we had Vg(fii) = Vg(fi 2 ), then for v = would 

hold 


dg(fii -f- tv) 
dt 


= v^Vgibi) = v^ygib 2 ) = 


t=o 


dgibi -4- tv) 
dt 


t=\b2-bi\ 


(5.6) 
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which is impossible since f e [0, 1^2 ^ b\ |] — gibi + tv) is strictly convex. □ 

Theorem 40. If conditions 35 and 36 hold, then be p{h) = V'T(h) is a diffeomorphism of 
the open sets A and p[A]. 

Proof From Condition 35 and Lemma 34, 'F is strictly convex. From Condition 36, Dp = 
V^'P = Z, which from 35 and the above discussion is positive definite. Thus, for U = A the 
thesis follows from Lemma 39 for g = 'F and Lemma 38 for f = p. □ 

Some important special cases of ECM for / = 1 are when X has a binomial, Poisson, or gamma 
distribution under <Q{b), be A, while for general leN+ — when X has a multivariate normal 
distribution (see page 130 in [5]). In all these cases, from Remark 37, Condition 36 holds. 
Furthermore, for the first three cases and non-degenerate multivariate normal distributions. 
Condition 35 is satisfied and we have analytical formulas for . In the gamma case, for some 
a, A e K+, and A = (-oo. A), for each be A, for A^, = A - fi, under Q(fi), X has a distribution with 
a density 


-^A?x“ ^exp(-A;,x) (5.7) 

F(a) 

with respect to the Lebesgue measure on (0,oo). Furthermore, for each be Ait holds 'F(fi) = 
«ln(jr^) and p{b) = and for each xe p[7l] = IR+, p~^{x) = A - f. In the Poisson case we 
have ^ = IR and for some initial mean poeR+, for each be Awe have pib) = po exp(fi) and 


Qmx=k] = 


pib)'^ 

k\ 


expi-pib)], 


A; e N, 


(5.8) 


i.e. X ~ Pois(p(h)) under <Q[b). Furthermore, it holds 'F(fi) = poiexpib) -1), be A, p ^(x) = 
In(^), xe p[A] = (0,oo), andZ(h) = pib), be A. In the multivariate normal case we have A = 
and for M e ^ being some positive semidefinite covariance matrix and po e R^ some initial 
mean, for each be A, pib) = po + Mb and under Qib), X ~ J^ipib),M). Moreover, it holds 
'F(h) = b^Po -t- ^b^Mb and Z(fi) = M, b e A. An important special case are non-degenerate 
normal distributions in which M is positive definite, pi A] = A, and p~^ix) = M~^ix- po), xe A. 
In the standard multivariate normal case we have M = f and po = 0, so that X ~ Jt'ib,Ii) 
under <Qib), be A. 

For an exponential tilting in which A = R^ we shall further need the following function defined 
for ae [ 0 ,oo) 


F(a) = sup{|'F(fi)|:fieR^ (5.9) 

For instance, in the multivariate standard normal case as above we have Wib) =^M- and thus 
2 

Fia) = while in the Poisson case Fia) = poiexpia) - 1). 

Remark 41. In some practical realizations of ECM, the computation times on a computer 
needed to generate Ltd. replicates of the IS estimator Z Lib) under (}ib) for different b e A are 
approximately equal to the same constant. This is typically the case e.g. when X ~ jViO, If 
under Qi. /n such a case one can often take the theoretical cost C = 1. 
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5.2 IS for independently parametrized product distributions 

Let n e N+. For each j e {1,..., n], consider a probability distribution Qi,; on a measurable 
space =5^1,; = a nonempty set At, and parametric families of probabilities Qiibi) 

and densities Liibi) = , bi e Ai. Let us define the corresponding product measure Qi = 

Qi,o product parameter set A = n”=i My and families of independently parametrized 
product probabilities Q(b) = and densities L(b) = YVi^-^Liibi), b = e A. Then, 

mb)~<QiandL[b) = ^ybeA. 

Let us further consider the special case of <Qi and Li as above being the exponentially tilted 
probabilities and densities given by some probabilities Qi,; and random variables X,-, having 
moment-generating functions <I>;, and cumulant generating functions 'T,, * = 1> • • • > tz. Then, 
Q(b) and L(b), be A, are the exponentially tilted probabilities and densities corresponding 
to the above probability Qi and a random variable Xio)) = o) e n”=i with a 

moment-generating function <5 (b) = n”=i‘b;(bi) and a cumulant generating function'P (b) = 

(b;). If Condition 35 or 36 holds in the zth case for z e {1. n}, then such a condition 

holds also in the product case. If m is the mean function in the zth case, i = then 

l^[b) = (p;(b;))”^^, b = (bi)”^^ e A, is such a mean function in the product case, and if all 
exist, then for each x = e p[A] = n"=iMd^;], P“^(x) = [fJ-Mixi))"^^. 

5.3 IS for stopped sequences 

5.3.1 Change of measure for stopped sequences using a tilting process 

Let Ui be a probability measure on a measurable space ‘io = {E,S), let‘5^ = {E,S) := '5^'^+, let 
b = (pz)z£N+ = idfi be the coordinate process on E, and let ry^ = b e N+. Let U be the 

unique probability measure onsuchthatpi, r] 2 , ..., arei.i.d. ~ Ui underU (see Theorem 16, 
Chapter 9 in [18]). Let.^fc = cr(pfc), keM+, i.e. it is the natural filtration ofp, and let .^o = {0yE}, 
i.e. it is a trivial u-field. For some d e N+ and a nonempty set B e let conditions 18, 

22, and 23 hold for A = 5, Qi = Ui, and some probabilities Q(b) and densities L{b) denoted 
further as U{b) and L{b), beB. Let K(b) = L{b)“^ = beB. 

Definition 42. We define ^ to be the set of all 5^ [B]-valued, km-^dapted stochastic 

processes X = (A^) km on 

Processes A as in the above definition shall be called tilting processes. The following lemma 
follows from Lemma 7, Chapter 21 in [18]. See Definition 18, Chapter 21 in [18] for the 
definition of Borel spaces. From Proposition 20 in that Chapter, SALB) is a Borel space. 

Lemma 43. Let'P be a measurable space, ^ be a Borel space, V be a X’-valued random variable, 
and Y be a SA-valued, a[V)-measurable random variable. Then, there exists a measurable 
function /: 'P — ^ such that Y = f[V). 

Let further in this section A be as in Definition 42. From the above lemma there exist bo e B 
and hjc: M’iB], k e N+, such that Aq = bo and Afc = b^lry^), b e N+, which let us further 


38 
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consider. Let 70 = 1 and 

n-l 

rn= neN+. (5.10) 

k=0 

For a nonempty set T a [0,oo) and a filtration on a measurable space let ^00 '•= 

ndjfer'^t)- A stopping time t for'i^feT- is a Tu {oo}-valued random variable such that t < 
teT. For such a t one defines a a-field 


% = {Ae^oo--An{T<t}e<St, teT}. (5.11) 

For T being a stopping time for the filtration {^k)km as above it also holds 

^j = {BeS:Bn{j = n}e^n,nEN}. (5.12) 

For a probability § on and such a t we shall denote §|t = § 1 .^^. Identifying each n e N+ 
with a constant random variable we thus have S|„ = S|^^. The following theorem is an easy 
consequence of Theorem 3, Chapter 22 in [18]. 

Theorem 44. There exists a unique probability V onsatisfying one of the following equivalent 
conditions. 


1. UnderV ,qi has density k (ho) with respect to V\ and for each ke^+,qjc+i has conditional 
densityxiXic) with respect toVi givent^k (seeDefinition 14, Chapter21 in [18]). 


2. For each neN, 

d\l\n ^ 

d\J\fi 


(5.13) 


Let V be as in the above theorem and let t be a stopping time for 
Lemma 45. It holds 


^|T ~T<00 'C\j, 


(5.14) 


with 


1](t<oo)7t = 


' dV\r ^ 


(5.15) 


Proof. To prove (5.15) we notice that for each B eJ^jWe have 

00 

Eu(1](Bn{T<oo})7T) = Y, IEu(11{{t = n}r\B)Yn) 


n=0 

00 


(5.16) 


= Y V({T = n}r\B) = V[B n {T < 00}), 

n=0 
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where in the second equality we used (5.12) and (5.13). Now, (5.14) follows from (5.15) and 
Lemma 8. □ 


From the above lemma, if t < oo hoth V and U a.s., then V|j ~ U|t. In this work, a product 
over an empty set is considered to he equal one. For some c e IR+, considered to avoid some 
technical prohlems as discussed above Condition 23, let us define 

1 

L= Kt <oo)-hcll(T = oo) = Kt < oo) L(Afc(r7fc+i)) +e:ll(T = oo). (5.17) 

Tj- k=o 


Then, from Lemma 45 and the discussion in Section 3.1 it holds 


L = 


dVi 


It I 


(5.18) 


Let Z be an IR-valued, -measurable random variable such that Eu(|Z|) < oo (for short we 
shall also informally describe such a Z as an IR-valued element of (U|t), see e.g. Chapter 
20 in [18]). Let us assume that U(Z ^0, t = oo) = V(Z 0, t = oo) = 0, so that from (5.14), 

~Z^0vt<oo ^|t 3-nd 


' dV^r \ 

^dVlT I Z^0yT<OO 


(5.19) 


Then, one can perform IS as in Section 3.2 for Qi = U|t, Q 2 = V|t, and L as above. Note that 
suchaQi is defined on,5^i = (Qi,.^i) = 


Remark 46. Consider two stopping times Ti,t 2 for such thatJi < T 2 and anU-valued 

Z e L^(Utj) such thatViZ ^0, T 2 = 00 ) = V(Z 7 ^ 0, T 2 = 00 ) = 0. Then, we also have Z e 
andV{Z 7 ^ 0, ti = 00 ) = V(Z 7 ^ 0, ti = 00 ) = 0. Furthermore, denotingL as in (5.17) forr = t, as 
Lj., we have 

l.y{ZLjfSTj,) = ZLj^. (5.20) 


Indeed, for each D e and i = 1,2, from (5.19) it holds 

EviZLjJiD)) = EviZLj.mniZ o})) = Eu(Zl(Dn {Z 7 ^ 0 })). ( 5 . 21 ) 

From (5.20) and conditional Jensen’s inequality we haveYasy[ZLjf > Varv(ZLjj), i.e. usingTi 
for IS as above leads to not higher variance than usingT 2 . Furthermore, Ev(ti) < Ev(T2), so that, 
for the theoretical costs equal to the respective stopping times, using ti also leads to not higher 
mean cost and inefficiency constant than T 2 (assuming that such constants are well-defined). 

5.3.2 Parametrizations of IS for stopped sequences 

For some I eN+ and a nonempty set A e SS{U^), let us consider a function 

( 5 . 22 ) 
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(see Definition 42), called a parametrization of tilting processes. For each be A, let V{b) and 
Vlrib) be given by A(fi) similarly as V and Vij are given by A in the unparametrized case in the 
previous section. Let Qi and be as in the previous section. Let for each be A, 

<Q{b) = V|T:(fi) and Lib) be defined by formula (5.17) but using A(fi) in the place of A. Note that 
such an L satisfies Condition 23. 

Condition 47. For each n e N, (fi, x) — A„ (fi) (x) is measurable from SA (A) <8> [E, ^n) to SA (B). 
Theorem 48. Under Condition 47, Condition 18 holds for the above L. 

To prove the above theorem we will need the following lemmas. 

Lemma 49. Let^ beaa-fleld, A e for some set T, Ct e^, t e T, and'ia = aiCt: teT). Then 

Ar\^:={AnC\Ce^}c:a[A,{Ar\Ct:te T}). (5.23) 

Proof Let sd = aiA,{AnCt : te T}) and Qi = {C e'^ : AnC e sd}. It holds 0 eS> and Cf e 3i, 
t e T.lf Bi eSi, i eN, then A n U/gn -8/ = U/gn ^ n 5, e sd and thus Uign ® • If 5 ®. then 

An = A \ (An 5) e .52/ and thus B’ e Si. Thus, S) = and we have (5.23). □ 

Lemma 50. Let (B, SA) be a measurable space, I be a countable set, SBi be a sub-a-field of SB, 
i e I, and Bt e SBi, i e I, be such thatlJi^j Bi = B. Then, 


JC = {CcB: V,e/CnB; e5§,} (5.24) 

is a sub-a -field of SB. If further for each i e I, for some set and for some Cij eSBi, j e Ti, it 
holds SBi = aiCi.t '■ t e fi), then 

ST = aiBi, {Ci,t nBi-.te fi ]: i e I). (5.25) 

Proof For each C e ST it holds CnBi e i e I, and thus C = \Jiei CnBi e SB. It holds 
0 e JC, for A; e JC, i e I, (Ui£N Ai) n Bj = Ui£n(A; n Bj] e 8Bj, j e I, and for A e JC, A' n B; = 
Bi \ (Bi n A) e SBi, i e I, so that JC is a sub-u-field of SB. For A e JC we have A = Ui£/ A n B, 
and AnBi E Bi n SBi, i e I- Furthermore, Bi n SBi ■= AC, i e I. Thus, AC = aiBi n SBi ■ i e I)- 
Therefore, (5.25) follows from the fact that from Lemma 49 

B;n^;c:o-(Bi,{C,',fnB,: te T;}), iel. (5.26) 

□ 

Lemma 51. Let (D, S) be a measurable space, nEN,be a filtration in a measurable space 
iD.,^), andr be a stopping time for such a filtration. Then, 

Si^S^r = {CEDxn: VfceNuloo} C D iD x {j = k]) E S 0 (5.27) 

Proof. Let us denote the right-hand side of 5.27 as AC. Then, it is equal to such a AC from 
Lemma 50 for B = D x Q, ^ @ (gi 7 = N u {oo}, and for Bi = (D x {t = i}) and SBi = SA^ S^i, 
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i e I, which let us further consider. From that lemma, 

Jf = aiCi X (C 2 n {t = i }): Q e C 2 e ^i, i e /). (5.28) 

By definition, = cr(Ci x C2 : Ci e C2 e For each Q e 0, C2 e and i e 7 it 

holds (Cl X C2)n(Dx {t = i]) = Ci x (C2n{T = i}) so that®i 8 >.^T c For each Ci e 

i e I, and C2 e it holds Ci x (C2 n {t = i}) e @ i 8 > so that JC c @ i 8 > □ 

Let us now provide a proof of Theorem 48. 

Proof. From Condition 47 and Condition 18 holding for L, for n e N, for jnib) given by A(h), 
h e A, in the way that jn is given by A in the previous section, [b, x) jn ib) (x) is measurable 
from ,5^ (A) i 8 > {E,^n) to Let B eSS[M). For n e N it holds 

L~\B) n (A X {t = n}) = r~\B) n (A X {T = n}) edB{A)®3Pn- (5.29) 

Furthermore, L~^{B) n (A x {t = 00 }) is equal to A x {t = 00 } if c e 71 and to 0 otherwise. Thus, 
from Lemma 51, L“^(71) e^(A)i 8 >.^j. □ 

Condition 52. Qi(Z 7 ^ 0, t = 00 ) = 0 andQ{b){Z ^0, t = 00 ) = 0, h e A. 

Condition 53. It holds t < 00 , Qi a.s. and Q(h) a.s., be A. 

Remark 54. From (5.14), Condition 21 is satisfied for Bi = {t < 00 }. Thus, for such a Bi, 
Condition 24 is equivalent to Condition 52. In particular. Condition 22 is implied by Condition 
53. 

Definition 55. Let B = IR^, A = IR^ and let an -valued process A = (.A]c)k>o on be such 
that for each j e { 1 ,...,/}, (((Afc);j)^^j)(teM e ^. Then, we define the corresponding linear 
parametrization A of tilting processes as in (5.22) to be such that 

Ak{b) = Akb, keN,beA. (5.30) 

Note that for A as in the above definition Condition 47 holds and we have (Q)(0) = Qi. 

5.3.3 Change of measure for Gaussian stopped sequences using a tilting process 

Let Ui = jY{0,Id), X = id||d, and let U(h) and L{b), be B := IR^, be the exponentially tilted 
distributions and densities corresponding to such X, Qi = Ui, and A = B, as in Section 5.1. For 
such distributions and densities, let us consider the corresponding definitions for stopped 
sequences for some tilting process A e ^ and h^, A; e N, as in Section 5.3.1. In particular, 
K[b)ix) = exp(-^|hp + b^x). Let = qk-^k-\> A e N+. The following theorem is a discrete 
version of Girsanov’s theorem. 

Theorem 56. Under V, the random variablesi]k, keN, are i.i.d. ~ JC{0,Id). 
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Proof. Writing hjc in the place of fc e N+, for each n e N+ and F e 

V((J7/)r=i e n = Eun({r/df=i e r)r„) 


r 11 ” 

= ^ exp(-- I (Xfc- dx„... dxi 

JflR'i)" 


' {2n) 


1 1 ” 

exp(-- E yfc)dyn---dyi 
^ k=l 


(5.31) 


where we used Fuhini’s theorem and a sequence of changes of variables ytCx/t) = X/t - ht-i, 
k=\,...,n, each of which is a diffeomorphism with a Jacobian 1. □ 

Let us consider a function n-.£—<■ E such that n = hs inverse function n~^ is given by 

the formula 


n ^ = (77(t + A/t-i(7r ^))fc£N+, (5.32) 

or in more detail we have n~^ = for r), = rji + A;-i, i e N+, where Aq = ho and A^ = 

A e N+. Note that both ti and n~^ are measurable from := {E,^„) to n e N, 
i.e. n is an isomorphism of neN, and thus also of ^oo := (E, J^oo) = ^• From Theorem 56 
we have U(5) = V(7r“^[B]), BeS,so that 


U(7r[5]) = V(5), BeS. (5.33) 

In particular, for each random variable F on the distribution of Ftt"^ := F in~^) under U is 
the same as of F under V. 

Remark 57. For li denoting the image function ofn, we have 
n{:^r] = {n{B]- Be^,} 

= {n[B]: Be J^oo- Bn{j = k}E fc e N u {oo}} 

= {C e J^oo : [C] n {t = A} e A e N u {c»}} (5.34) 

= {Ce^oo- Cn{j7i~^ = k} e^j^, fceNuioo}} 

= ^T7:-'> 

where in the fourth equality we used the fact thatn is an isomorphism of‘^n> n £ N u {oo}. In 
particular, if a random variable Y on is -measurable, then Yn~^ is -measurable, i.e. 
it depends only on the information available until the time Tn~^. 

For some parametrization A(h), bEA,of tilting processes as in (5.22), let Ub be given by A(h) 
in the way that n is given by A above. Let further Q{b) and Lib), b e A, correspond to such 
a parametrization as in Section 5.3.2, and let 5A\ = (Qi,.^i) and Qi be as in that section. Let 
^-.ExA^Ehe such that 

^iq,b) = 7il^iq), he A. (5.35) 
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Theorem 58. Under Condition 47 and the above definitions, Condition 19 holds for 
andV’i = U. 


Proof. From (5.32) it follows by induction that ^ is measurable from i8> ^(A) to n e 
N. Thus, it is also measurable from ^ 5^[A) to lo and due to c g, also to 

Furthermore, from (5.33), 

V{a;b)-hB]) = y\rib){B), Be^„ (5.36) 

i.e. (3.19) bolds. □ 


For each random variable F on and be A, let us denote 

F® = F7r^i = F(^ (•,!;)). (5.37) 

Note that from (5.32) we bave 

(^(77,b))fc = 77® =r?fc + Afc_i(b)®, keM+. (5.38) 

For each b e A it bolds 

T-l Y 

Lib) = Kt < oo) exp(^ i-\^k(b)f - Afc(b)^ 77 ^+ 1 )) + 11(t = oo)e. (5.39) 

k=o 

From (5.38) and (5.39), for each b' ,be A we have 


1 (b) 


= 11(t'^'’< oo)exp( Y, (-l•^/t(b)'^'’|^ 
fc=o ^ 


(Afc(b)'^'^)^(77fc+i + Afc(b')'^'>))) + l(T'^'^ = oo)e 


(5.40) 


and in particular 


Lib) 


(b) _ 


T®-1 J 


<oo)exp(- Y (-|Afc(b)®|^ + (Afc(b)®)^r 7 fc+i)) + 11(T® = oo)c. (5.41) 
k=o ^ 


5.3.4 Linearly parametrized exponential tilting for stopped sequences 

Let Ui, = iE,h, X, Vib), Lib), 'F(b), beB = U'^, and F be as some Qi, = (Oi, J^i), 
X, Qib), Lib), Wib), b e A = B, and F in the ECM setting in Section 5.1. Let A be a linear 
paramatrization of tilting processes corresponding to some A as in Definition 55 and consider 
the corresponding families of probabilities Q(b) and densities Lib), be A, as in Section 5.3.2. 
Note that we now have from (5.17), for f/(b) = Kt < oo)X^Io'L(Afc(b)), be A, and FL = -Kt < 
oo)I[:Q(X(r 7 fc+i))^Afc, that 

L(b) = KT<oo)exp([/(b) + FFb) + eKT = oo), be A. (5.42) 
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We shall call the above parametrization of IS the linearly parametrized exponentially tilted 
stopped sequences (LETS) setting. Its special case in which Ui = ^(0,1a) and X = id^d shall 
he called the linearly parametrized exponentially tilted Gaussian stopped sequences (LETGS) 
setting. Note that the LETGS setting is a special case of the parametrized IS for Gaussian 
stopped sequences as in Section 5.3.3. In the LETGS setting H = -Kt < and 

for G:= Kt < we have U{b) = b^Gb, be A, so that 

L(h) = ll(T<oo)exp(h^Gh + //h) + 'Il(T = oo)e, be A. (5.43) 

Eurthermore, we have < oo) ^ 

< oo) X (r/i:+i+Af^h')^Af’, (5.44) 

k=0 

and formula (5.40) can he rewritten as 

L{b)^^^ = <oo)exp(h^G^^^h + f/*^'h) + ll(T*^' =oo)e. (5.45) 

Remark 59. Note that in the LETGS setting, om <oo we have 

inf ln(L(h)) > ^ inf (dy|^ - Ty^y) =^ |r?fc|^ e K. (5.46) 

b(£K‘ 2 ^ k=l 

Remark 60. In our numerical experiments performing IS for computing expectations of func¬ 
tionals of an Euler scheme in the LETGS setting, the simulation times were roughly proportional 
to the replicates ofr under <Q[b). Thus, on several occasions in this work when dealing with the 
LETGS setting we shall consider the theoretical cost G = st for some 5 e IR+. 

Remark 61. Gonsider the special case of the LETS setting in which A is a sequence of constant 
matrices and t = n e N+ is deterministic. Then, for the above Q{b) and L[b), be A, a family 
of probabilities Q'{b),b e A, on such thatQ'ib){T]„[G]) = Q(h)(G), G e b e A, and 
L’: Ax E" -^U such thatL'ib) (ry„) = L{b), are the exponentially tilted families of probabilities 
and densities corresponding to := U” andX'{(if) := L”=i Aj_.^X{(i)i), w = {(Oi)"_.^ e E”, as in 
Section 5.1. Note that for each random variable Y' on 'if”, Y = Y'{qn) is an t^n->^^tisurable 
random variable with the same distribution under <Q[b) as ofY' under Qib)', be A. Note also 
that if further T = 1 and Aq = Id, then Qj := Ui, L'{b) = L{b) Q'{b) = U(h), be A, and X' = X. 

5.4 IS for a Brownian motion up to a stopping time 

Let us now briefly discuss IS for computing expectations of functionals of a Brownian motion 
up to a stopping time. Eor some d e N+, let B = [Bt)t>o be the coordinate process on the 
Wiener space 5f([0,oo),IR^), whose measurable space let us denote as W. Let be the 

natural filtration of B. Let U be the unique probability on W for which B is a d-dimensional 
Brownian motion (see Chapter 1, Section 3 in [44]). Eor a probability S on ^ and a stopping 
time T for [^t)t>o, we denote §|t = S|_^ . From Girsanov’s theorem, if (Af)f>o is a predictable 
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locally square-integrable K'^-valued process on W for which 


Jt = exp 


f lAsl^ds], t> 0 , 

Jo ^ Jo 


(5.47) 


is a martingale under U (for which e.g. Novikov’s condition suffices), then from Kolmogorov’s 
extension theorem there exists a unique measure V on ^ such that = 0. Further- 

dV\t: 

more, 


Bt = Bt- 



f > 0, 


(5.48) 


is a Brownian motion under V. From Proposition 1.3, Chapter 8 in [44], for a stopping time t 
for f >0 > we have 11 (f < oo) ff = [ ] and thus L = 1 (f < oo) i ] , similarly as 

in the discrete case. Thus, if for some IR-valued Z e (U|f) we have U and V a.s. that t = oo 
implies Z = 0, then U|f ~z^o and we can perform IS for computing Ejj(Z) analogously as 
in the discrete case. For adaptive IS, for some leN+, we can use e.g. linear parametrization 
Af(h) = Afh, beA:=U^ of tilting processes for some -valued predictable process (Af)f>o 

with locally square integrable coordinates. 

Due to the fact that the sequence (Bfc+i - has i.i.d. ~ JViO, I a) coordinates under U, 

under appropriate identifications the LETGS setting can be viewed as a special discrete case of 
the IS for Brownian motion with a linear parametrization of tilting processes as above. In the 
further sections we focus mainly on the discrete case, both for simplicity and due to it having 
important numerical applications. However, many of our reasonings can be generalized to 
the Brownian case. 


5.5 IS for diffusions and Euler schemes 

Let us use the notations for IS for a Brownian motion from the previous section. Let us 
consider Lipschitz functions ji : ^([R™) - ^(K™) and a : Then, there 

exists a unique strong solution Y of the SDE 

dYt = n{Yt)dt + a{Yt)dBt, To = Xq (5.49) 

(see e.g. Section 5.2 in [31]). Such a F is called a diffusion, p a drift, and a a diffusion matrix. 
For T being a stopping time for t>o (like e.g. some hitting time of F of an appropriate set) 
and some K-valued Z e (U|f), one can be interested in estimating 


0(xo) = Eo(Z). (5.50) 

A popular way of discretizing F, especially in many dimensions, is by using an Euler scheme 
X = (Xfc)fceM with a time step h e IR+, which, for some pi,p 2 ..... i-i.d. ~ JYiOJm) and some 
starting point xo e U'", fulfills Xq = xo and 

Xt+i = Xt + hfiiXk) + '/h(J{Xj,)T]k+i, keN. (5.51) 
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We shall sometimes need a time-extended version X' of such an X, defined in the helow 
remark. 

Remark 62. For an Euler scheme X as above, X’ = [X^, fch)fc£N Is also an Euler scheme, in 
the definition of which, in the place ofm, xq, p and a, we use m' = m + \, x'q = (xo,0), as 
well as p' : R'”' — R'”' and a' : R"*' ^ ^m'xd that for each x e R'” and t eU we have 
p'ix, t) = {pix),l),a'. jix, t) = aij{x), i < m, anda'^, .[x, t) = 0,j e {1 . d}. 

Let further pi, i e N+, be as in Section 5.3.1 for Ui = d^{0, Ifi, so that X as above is an Euler 
scheme under U as in that section. As discussed further on, in some cases, for a sufficiently 
small h, for an appropriate stopping time t for [t^n)n>o and an appropriate Z e L^{U|t), (plxfi 
can be approximated well using 

0(xo) = Eu(Z). (5.52) 

For some function r : ,5^(R™) — called an IS drift, let us consider a tilting process 

Afc = \/hr{Xjc), keN. Then, for 

p = p + ar (5.53) 

and ijfc, keN,as in Section 5.3.3, we have 

Xic+i = Xic+hp{Xic) + '/ha[Xk)f]k+\, keN, (5.54) 

so that from Theorem 56, X is an Euler scheme under V with a drift p. As discussed in Section 
5.3.3, the distribution of X under V is the same as of X := X(7t~^) under U. Since f]i = pin, we 
have pin~^ = pi, i e N+, so that X satisfies Xq = xq and 

Xic+i = Xic+hp{Xic) + Vha[Xk)pk+\, keN, (5.55) 

i.e. it is also an Euler scheme with a drift p, but this time under U. 

For a nonempty set A e 5^(R^), let us consider a parametrization r : A {f : R'” —► R^} of 
IS drifts, such that ib,x) -* r(h)(x) is measurable from 5^iA) to and let 

p{h) = p + ar[b), b e A. Consider a parametrization A: A -* ^ of tilting processes such that 

A(h) = (Afc(h))fc£N = (\/fir(fi)(Xfc))fceN, be A. (5.56) 

Note that Condition 47 holds for such a parametrization. Note also that, using notation (5.37), 
from (5.55) we have 

= X® + hp(h)(X®) + \/hc7(X®)pfc+i, keN. (5.57) 

Let us now describe the linear case of the above parametrization, leading to IS in the special 
case of the LETGS setting. We take A = R^ and for some functions f, : ,5^(R'”) — 
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i = called IS basis functions, we set 

i 

r{b){x) = 'Y^biri{x), beu\ xeU"^. (5.58) 

1=1 

Let 0 : K'” be such that for i = 1,..., Z and j = l,...,d 

&j,i{x) = '/hin) jix). (5.59) 

Then, a process A leading to A(h) given by (5.30) and such that (5.56) holds, can he defined as 

Aj. = 0(Xfc), keN. (5.60) 

An example of a stopping time t for (.^^1 fc>o is an exit time of X of some D e that is 

T = inf{Z: eN: Xjct D}, for which we have = inf{fc e N : X® C D}, be A. 

Theorem 63. Let us consider some linear parametrization of IS drifts as above. Letr be the exit 
time ofX ofD e such that xq e D, let B c be nonempty, and let there exist v e R™, 

v^O, such that 

Ml := sup |i;^(x-y)| <oo (5.61) 

x,y£D 


ClTld 


M 2 := sup I r’^p(h)(x)| < 00 . (5.62) 

x&D.b&B 

For some i e {\,...,d}, let there existSi e IR+ anddj e [0,oo), j e {l,...,d}, j i, such that 
|(r'^o'(x));| > 5i and\{v^a{x)]j\ < 5j, j 7 ^ Z, x e D. LetM = Mi + hMz and consider the following 
random conditions on ^ for keN+ 

tlki(^) = i\'nk,i(.(^)Si\> ^ + \ Y. rik.ji(^)Sj\), weE. (5.63) 

Then, a random variable f onsuch that fio)) = infjfc e N+ : < 7 fc(fc»)}, we E, fulfills < f, 
beB. UnderV, the variable f has a geometric distribution with a parameter q = Viq\), thatis 
V[f=k) = qa-q)’^~KkeN+. 

Proof Let beB. From (5.57), for each fc e N+ 

(Xf €D)a (Xf_\ eD) = (\/h( 7 (xf_\) 77 fc CD- xf_\ - hp{b] (xf_\)) A (xf_\ e D) 

^ {Vhv^a{Xl^\)Pk C v^iD - X'^\ - Zip(h)(xf_\))) A (xf_\ e D) 
^ iVh\ i;^(7(xf_\)r?fc| > M) A (X^^ e D) 

^(7fcA(xf_\eD). 

(5.64) 
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Thus, qt ^ t Dv X® t D) ^ < k. For w e £ such that f (&») < oo it holds qt^o)) 

and thus (to) < f (to). □ 


Remark 64. Note that if for each be A the assumptions of Theorem 63 hold for B = {b}, then 
fromr^^^ having the same distribution underV asT underVib), we receive that r hasallflnite 
moments underVib), be A, and in particular Condition 53 holds. 

We say that a matrix- or vector-valued function / is uniformly hounded on some subset B of 
its domain if for some arbitrary vector or matrix norm 11 • 11 we have sup^^g I |/(x) 11 < oo. 

Remark 65. Note that (5.61) holds for D bounded and arbitrary v e R'”. Furthermore, if for 
some V e R'”, v^p, v^a, and 0 are uniformly bounded on D (which holds e.g. when they are 
continuous on R^ and D is bounded) then (5.62) holds for each bounded B. 

5.6 Zero-variance IS for diffusions 

To provide an intuition when the variance of the IS estimator of the expectation a functional of 
an Euler scheme can be small, let us briefly describe a situation when its diffusion counterpart 
has zero variance. See Section 4 in [21] for details. Using notations as in the previous section, 
for f being the hitting time of F a boundary of an open set D such that Xq e D, as well as for an 
appropriate g : R™ — R and f : R™ R, consider 

Z = l(f < oo)g(FU exp( r fiYs) d5). (5.65) 

Jo 

If there exists an appropriate function u : R™ — R, such that for Lu = Tr[aa^)Au + p^V u + fu 
we have 

Lu(x) = 0, xeD, (5.66) 


and 


u{x) = g{x], xedD, 


(5.67) 


then, from the Feynman-Kac theorem, 0(xo) = u(xo). Under certain assumptions, including 
u{x) > 0, X e D, it can be proved (see Theorem 4 in [21]) that for r equal to 


a^Vu 


r := 


= V(ln(w)), 


(5.68) 


for the IS for a Brownian motion as in Section 5.4 with At = r* (Ff), we have ZL = cpixo), V a.s., 
i.e. the IS estimator for the diffusion case has zero variance. Furthermore, from (5.48) 


dYt = [p + ar*){Yt)dt + a[Yt)dBt, Fo = Xo. 


(5.69) 
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For T being the exit time of X of some set B, a possible Euler scheme counterpart of (5.65) is 

T-1 

Z = l(T<oo)g{X^)exp(X hpiXk)). (5.70) 

k=0 

Under appropriate assumptions for such a Z we have 

lim(/)(xo) = 0(xo) (5.71) 

h^O 

for B = D; see [24, 25]. Furthermore, in [25] it was proved that in some situations the rate of 
convergence in (5.71) can be increased by taking as B an appropriately shifted D. Further on 
for T as above and Z as in (5.70) we shall assume that B = D, but one can easily modify the 
below reasonings to consider the shifted set instead. It seems intuitive that for some such Z, 
for r close to r *, and for small h, we can receive low variance of the Euler scheme IS estimator 
ZL. This intuition shall be confirmed in our numerical experiments in Chapter 10. 


5.7 Some examples of expectations of functionals of diffusions and 
Euler schemes 

We shall now discuss several examples of expectations of functionals of diffusions and their 
Euler scheme counterparts. As discussed in Chapter 1, these expectations can be of interest 
among others in molecular dynamics, and their Euler scheme counterparts were estimated in 
our numerical experiments described in Section 10. In the first two examples, for diffusions 
we consider the expectations (p[xo) = Ey(Z) for some Z as in (5.65), and for the corresponding 
Euler schemes we consider 0(xo) = Eu(Z) for the variable Z as in (5.70). In the first example, 
for some p e IR+ we take )S(x) = -p and g(x) = 1, x e K'”, so that Z = expi-pf)'iif < oo) and 
Z = expl-phTlllT < oo). The quantities mgf(xo) := 0(xo) and mgf(xo) := 0(xo) for this case are 
called the moment-generating functions (MGFs) of f and hr respectively. Let us consider 
some fl e IR, called an added constant. For the second example let us assume that 

U(T<oo) = U(f<oo) = l (5.72) 

and let D' = R'” \ D = Au B for two closed disjoint sets A and B from ^(R'”). Let /3(x) = 0, 
X e R'”, g(x) = a+ I, X e B, and g(x) = a, x e A. We receive Z = 1](t < oo)(fl-t- "[[(Fj e B)] and 
(p[xo) equal to qAB,a(xo) := a + V[Yj e B)), which we shall call a translated committor. For the 
added constant a = 0, we denote qAB.aixo) simply as qAB (^o) and call it a committor. In the 
Euler scheme case we consider analogous definitions but with omitted tildes and with X in 
the place of Y. Committors are of interest for instance when computing the reaction rates and 
characterizing the reaction mechanisms of dynamic processes; see [26, 42, 3]. 

For the third example, for some D, X, t, and f as in Section 5.6, as well for some T eU+, let 
us now consider Z = Ilf <T) + a, ppai^o) = Eg(Z), and prixo) = Pr,oixo), while for the Euler 
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scheme case Z = Hihj <T) + a, pT,aixo) = and prixo) = pr,oixo). Note that for 


t’ = T A 


T 

h 


(5.73) 


it holds Z = \Xj’ e D') + a. Note also that for the time-extended process X' corresponding to 
the above X as in Remark 62, such a t' is the exit time of X' of 


D = D X [0,h 



(5.74) 


Such a t' is the stopping time which we shall further consider by default for IS in the LETGS 
setting for computing pr.ai^o)- A possible alternative would be to use t, which, as discussed 
in Remark 46, would lead to not lower variance and mean cost for the cost variables equal to 
the respective stopping times. 


Remark 66. Sufftcient assumptions for (5.71) to hold for the MGFs and translated committors 
as above can be derived e.g. from the discussion in Section 4 in [25] (along with appropriate 
convergence rates in it), while for 


lim pr,fl(xo) = Pr.aixo) (5.75) 

h^o 

—from reasonings analogous as in Section 1.2 of [24]. 

Let if a be an unbiased estimator of 1 //^ equal to qAB.aixo) or pr.aixo), i.e. E{y/a) = ''pa- Then, the 
translated estimator {ffa.o = - a is an unbiased estimator of ipo equal to qAB (xo) or pr (xo) 
respectively, and Var(i^fl,o) = Vari^a). The reason why we are considering such translated 
estimators of y/o for nonzero added constants a is that using these estimators in the adaptive 
IS procedures in our numerical experiments as discussed in Chapter 10 led to lower variances 
and inefficiency constants than for a = 0. 

Note that we have qAB (xo) + qBAixo) = 1 and similarly for the diffusion case, so that if q is an 
unbiased estimator of one of the quantities qAsixo) or qsAixo), then 1 - ^is such an estimator 
of the other quantity with the same variance and inefficiency constant. Therefore, given an 
estimator ^ab of <7 ba(xo) and ^ab of qsAixo), it seems reasonable to compute both quantities 
as above using the estimator leading to a lower inefficiency constant. 


5.8 Diffusion in a potential 

We define a diffusion F in a differentiable potential V : R'” '—» R and corresponding to a 
temperature c e R+ to be a unique strong solution of 

dYt =-yViYt)dt+V^dBt, Fq = xo, (5.76) 

assuming that such a solution exists, which is the case e.g. if VF is Lipschitz. For such a 
diffusion, under appropriate assumptions as in Section 5.6, an IS drift (5.68) leading to a 
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zero-variance IS estimator and probability V is 

r* = \/^{Vln(M)). (5.77) 

Let F = -eln(M), Co £ IR, and let us define an optimally-tilted potential 

V* = V + 2F+Co. (5.78) 

Then, (5.69) can be rewritten as 

dYt = -yV*{Yt)dt+V^dBt, Yo = Xo. (5.79) 

Thus, under V, F is a diffusion in potential V*. 


5.9 The special cases considered in our numerical experiments 

Let D := (ai, fl 2 ) = (-3.5,3.5). Consider a smooth potential F; R — R such that 

F(x) =—(0.5x®-15x^-tll9x^ 4-28x4-50), xeD, (5.80) 

200 

and VF is Lipschitz. Such a F restricted to D is shown in Figure 5.1. For a temperature e = 0.5, 


V 



Figure 5.1; The three-well potential given by (5.80) on D. 

consider a diffusion F in such a potential starting at some Xq e D. Let f be the hitting time 
of F of the boundary of D. Let A = (-oo, aO, B = (a 2 ,oo), and let qi^a = dAB.a and q 2 ,a = dBA,a 
(see Section 5.7), which for a = 0 will be denoted simply as qi and q 2 , and analogously in 
the Euler scheme case in which the tildes are omitted. Let us also consider mgf and mgf for 
p = p:= 0.1. We computed approximations of such qdx) and mgf(x) in the function of x using 
finite difference discretizations of PDEs given by (5.66) and (5.67). The results are shown in 
figures 5.2a and 5.2b. In figures 5.3a and 5.3b we show approximations of the optimally tilted 
potentials (5.78) for the MGF and committors qt^a for a = 0 and a=d:= 0.05, i = 1,2. 
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Committors MQF 




(a) (b) 

Figure 5.2: The committors and MGF as in the main text. 


Optimally tilted potentials for committors 



(a) 


Optimally tilted potential for MGF 




Figure 5.3: Optimally tilted potentials as in (5.78) for the MGF and committors. Fqi and Fqia 
are the functions F as in Section 5.8 for the ith committor for a = 0 and a = a respectively. The 
constant Cq for the ith committor for i e {1,2} was chosen so that the tilted potential is equal 
to the original potential in point a, and for the MGF — so that these potentials are equal in 
both fli and az- 
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In our experiments we considered an Euler scheme X with a time step h = 0.01 corresponding 
to the above diffusion Y starting at Xq = 0. We focused on estimating mgf(xo) for p = p, qtixo) 
for i = 1,2, and prixo) for T = 10. For some M e N 2 , ay = -3.6, 0.2 = 3.6, d = and 

Pi = di + a - l)a, j e {1,..., M}, consider Gaussian functions 


fiix) = 


1 

71 


exp{- 


jx-pif 




(5.81) 


In our experiments we used a linear parametrization of IS drifts as in Section 5.5. For each 
estimation problem we used as the IS basis functions the above Gaussian functions for M = 10. 
For estimating pr.aixo), considering a time-extended Euler scheme as in Remark 62 corre¬ 
sponding to the above X, we additionally performed experiments using 2M time-dependent 
IS basis functions 


fiix, t) = nix), fM+iix, t) = t^fiix), i = l,...,M, (5.82) 

for different p e N+, and for M =5 and M = 10. See Section 7.1 and Chapter 10 for further 
details on our numerical experiments. Note that since the above f; are continuous and D is 
bounded, from remarks 64, 65, and Theorem 63, in which one can take v = \ and 5i = ^/2£, it 
follows that Condition 53 holds when estimating the MGF and committors as above. Using 
further the fact that Ey(f) <00 (which follows e.g. from Lemma 7.4 in [31]), (5.72) holds. 
Furthermore, from Remark 66, we have (5.71) for the MGF and tanslated committors, and 
(5.75) for the exit probability. 
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6 


Some properties of the minimized 
functions and their estimators 


In this chapter we discuss various properties of the functions and their estimators from 
Chapter 4 for some parametrizations of IS from the previous chapter. These properties will 
he useful when proving the convergence and asymptotic properties of certain minimization 
methods of such estimators further on. 


6.1 Cross-entropy and its estimators in the ECM setting 

Let us consider the ECM setting as in Section 5.1. We have 

ce„ ib', b) = iZL'm’{b)-bX))„ = W{b) {ZL') n-b^ (ZI'X) „. (6.1) 

Let us assume Condition 36. Then, 

S/b^nib',b) = y^{b){Zir)n - {ZL'X)n (6.2) 

and 

vl^nib'M = v2'P(h)(ZL')„. (6.3) 

Let us further assume Condition 35, so that V^'P is positive definite. Then, from (6.3), b — 
ce„ [b’,b) (to) has a positive definite Hessian and thus it is strictly convex only for to e L2” and 
b' e A such that 

(^«(to)>0. (6.4) 

Furthermore, fi* e A is the unique minimum point of fi — ce„ {b’, b) (to) only if (6.4) holds and 

V^ce„(fi',fi*)(to) = 0 (6.5) 

(whereby VbCe„(fi',fi*)(to) wemean V^(ce„(fi',fi)(to))i,=i,*). Assuming (6.4), from (6.2), (6.5) 
holds only if 

, , {ZL'X)n 

^i(b*„] = VW{b*„) = _ (to), (6.6) 

iZV)n 
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or from Theorem 40 only if [o)) e filA] and 


Let us assume that 


{ZL’X)„ 1 

(ta) 

iZL')n 


(6.7) 


E(Q)j(|ZX,'|) <oo, i = l . 1. 


( 6 . 8 ) 


Due to X having finite all mixed moments, from Holder’s inequality, (6.8) holds e.g. when 
Eqj (|Z|P) < oo for some p> 1. For the cross-entropy we then have 

ce(fi) = a'T(fi)-fi^EQj(ZX), be A. (6.9) 


Thus, analogously as for the cross-entropy estimator above, ce has a positive definite Hessian 
everywhere only if a > 0, and ce has a unique minimum point only if a > 0 and 


Eqi(ZX) 

- ep[A], 

a 


in which case such a point is 


b*=p-^ 


Tq,(ZX) \ 
a ]■ 


( 6 . 10 ) 


( 6 . 11 ) 


Remark 67. Note that we can receive analogous conditions as above for the cross-entropy and 
its estimator to have negative definite Hessians or have unique maximum points by replacing Z 
by-Z (and thus also a by-a) in the above conditions. The formulas for the maximum points 
remain the same as for the minimum points above. With some exceptions, in the further sections 
we shall focus on the minimization of cross-entropy and its estimators and will be interested in 
checking the conditions as in the main text above. However, we can analogously perform their 
maximization, or jointly optimization if we consider alternatives of the above conditions. 


6.2 Some conditions in the LETS setting 

Let 11 • I loo denote the supremum norm induced hy the standard Euclidean norm | • |. Consider 
the LETS setting as in Section 5.3.4. For each real matrix-valued process Y = (Tjtlfce^ on and 
B e let us define 

Ili^llr.B.oo = esssup(llB 11(0 <T<oo) maxdIFolloo,... J|Lr-ll loo)), (6.12) 

u 

which for B = Hi is denoted simply as 11 F| | j,oo- Let S he an K-valued random variable on . 
Further on in this work we will often assume the following conditions. 

Condition 68. It holds 

:= l|A||r,s^0,oo<oo. (6.13) 
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Condition 69. A number 5 e N+ is such that 

11(S7^0)T<s. (6.14) 

Note that conditions 68 or 69 hold for each possible random variable S as above only if they 
hold for some S such that S(w) 7 ^ 0 , o) e Qi, that is only if 

l|A|h,oo<oo (6.15) 

for Condition 68 , or 

T<seN+ (6.16) 

for Condition 69. 

Remark 70. Note that Condition 69 implies Condition 52 for Z = S, while (6.16) implies 
Condition 53. 

For each real matrix-valued function / on K™ and B c IR'”, let us denote W/Wb.oo = sup^g^ 11/I loo- 
If T is the exit time of an Euler scheme X of a set D such that Xq e D, then for A is as in (5.60) 
we have ||A|h,oo^ I|0 |Id,oo- In particular, if 

I|0|Id,oo<cx), (6.17) 

then we have (6.15). Note that from (5.59), (6.17) is equivalent to ||fdln,oo < 00 , i = 1,..., Z. 
In particular, (6.17) and thus also (6.15) hold in our numerical experiments as discussed in 
Section 5.9, both when using the time-independent and time-dependent IS basis functions, 
where in the time-dependent case by r, we mean f, as in Section 5.9 and we consider D equal 
to D as in (5.74), X equal to X' as in Remark 62, and t equal to t' as in (5.73). 

Let us discuss how one can enforce (6.16) if it is initially not fulfilled, as is the case for the 
translated committors and the MGF in our numerical experiments. Analogous reasonings 
as below can be applied also to more general stopped sequences or processes than in the 
LETS setting. For some s e N+ and Zs e R, instead of t and Z we can consider their terminated 
versions = t A s and Zs = 11(t < s)Z + > s) and focus on computing as = Eu(Zs) = 

EuIKt < s)Z) + ZsU(t > s) rather than a = Eu(Z). If U(t = 00 ) = 0, or U(Z 7 ^ 0, t = 00 ) = 0 
and lims^oo-Zi = then U a.s. Zs Z, so that assuming further that limsup^^o^ Iz^l < 00 , 
from \Zs\ < |Z| -t- Iz^l and Lebesgue’s dominated convergence theorem, lim^^oo ots = ot- Thus, 
in such a case, for a sufficiently large s we will make arbitrarily small absolute error when 
approximating a by as. Let us provide some upper bounds on this error. If esssupydZ- 
ZsIUt > s)) < Ms e [ 0 ,oo), then 

\a-as\ = |Eu((Z-Zs)11(t> s))| <MsU(t>s). (6.18) 

For the MGF example from Section 5.7 we can take Zs = Ms = ^exp[-ph[s+ 1)), while for 
the translated committors we can choose Zs = a+^ and Ms = \. The quantity U(t > s) can 
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be estimated using IS from the same simulations as used to estimate Ug or in a separate 
IS MC procedure. Alternatively, if we have t < f for some random variable f with a known 
distribution, we can use the inequality U(t > 5 ) < U(f > s) to bound the right side of (6.18) 
from above. For instance, if f has a geometric distribution with a parameter q (see Theorem 
63 for a situation in which this may occur), then we have U(f > s) = (1 - q)^ and thus |a - a^l < 
Msil-qY. 

6.3 Some conditions in the LETGS setting 

Let us discuss some conditions and random conditions in the LETGS setting, which, as we 
shall discuss in the further sections, turn out to he necessary for the existence of the unique 
minimum points of cross-entropy, mean square, and their estimators in this setting. Let Z he 
an IR-valued -measurable random variable (where S^\ - 

Definition 71. For be A = U^, we define a random condition Aj, on SA\ as follows 

Ah = [Z 0 ^ 0 < T < 00 , and there exists keN, k<T, such thatA^^ih) ^ 0). (6.19) 

Lemma 72. If Ah does not hold and Z 7 ^ 0, then for each aeU^ and f e IR 

Lia+tb) = Lia). (6.20) 

Proof. From (5.39), when t = 0 then the both sides of (6.20) are equal to 1 and when t = 00 — 
to e. If Ah does not hold, Z 7 ^ 0, and 0 < t < 00 , then for each 0 < k < t we have Afc(h) = 0, and 
thus for each aeU^ and t eU, Ajcia+ tb) = Afc(a) -f- tAkib) = Ajda), so that ( 6 . 20 ) also follows 
from (5.39). □ 

Lemma 73. For n e N+, the following random conditions on AZf are equivalent. 

1. For each beu\ b^Q, there exists i e {l,...,n} such that {Ah)i holds (where we use the 
notation as in (4.16)). 

2. For some (equivalently, for each) random variable K on kl which is positive on Z 
(KZ 7 ^ Q)GK)n is positive definite. 

3. It holds N := KZ; ^ 0, t; < oo)t/ > 0. Let a matrix B e k^4N)xI gn^h that for each 
i e {l,...,n} such thatO < T; < 00 and Zi 7 ^ 0, for each fc e { 0 ,. ..,Tj - 1 } and j e {I,..., d} 
theYff2\ ^(Z 7 ^ 0, T p < oo)t p + kd + jth row ofB is equal to the jth row oflAkp. Then, 
the columns ofB are linearly independent. 

Proof. The fact that the second point above is a random condition follows from Sylvester’s 
criterion. The equivalence of the first two conditions follows from the fact that for each beU^ 

b'^imz ^ 0)G)nb = 7^ E (1(0 < T < 00 , Zt^O)^:^ \^kib)f)i (6.21) 

and the equivalence of the first and last condition is obvious. □ 
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Definition 74. We define r„ to be one of the equivalent random conditions in Lemma 73. 
Lemma 75. The below three conditions are equivalent. 

1. For each be IR^ b^^O, wehaveQiiA},) > 0. 

2. For each b e U^,from Qi [Ah) = 0 it follows that b = 0. 

3. LetAj = [[Aic,i,j)f^f)keN ^ ^, j = 1,---,l (see Definition 42). Let ~ bea relation of equiv¬ 
alence on ^ such that for ■ii/i,y /2 e y/i ~ y/z, only ifQi a.s. ifO < t < oo and Z jtQ 

then y/ij = y/zj, i = 0 . t - 1. Then, the equivalence classes [AJ^,..., [A/]^ are linearly 

independent in the linear space ^ l~ of equivalence classes of~, defined in a standard 
way (i.e. the operations in such a linear space are defined by using in them in the place of 
the equivalence classes their arbitrary members and then taking the equivalence class of 
the result). 

Proof. The equivalence of the first two conditions is obvious. The equivalence of the last two 
conditions follows from the fact that, using notations as in the third condition, for b 

biAi is equal to the zero in^/~ onlyifQi(Ai,) = 0. □ 

Condition 76. We define the condition under consideration to be one of the conditions from 
Lemma 75. 

Remark 77. Note that for a probability § ~r<oo Qi ute have §(Aj,) > 0 only ifQiiAh) > 0, so 
that Condition 76 holds only if it holds for such a S in the place o/Qi. 

Remark 78. Note that QiiAh) > 0 only if for some I e N+ and k eN, k < I, we have Qi(Z 
0,T = l, Xkib)7^0)>0. 

Lemma 79. Let for some probability S ~t<oo Qn ti random variable K onS^i beS a.s. positive 
on Z 7 ^ 0, and let 11(Z ^ 0)KG have S-integrable entries. Then, EsilliZ ^ 0)KG) is positive 
definite only if Condition 76 holds. 

Proof. For each beU\ b^O, 


1 

fi^E§(11(Z7^0)/:G)fi= -E§(11(Z7^0, 0<T<oo)r^ |Afc(fi)|2) (6.22) 

2 k=o 

is greater than zero only if SCA^) > 0, so that from Remark 77 we receive the thesis. □ 

LetSym„(IR) denote the subset of K”’'” consisting of symmetric matrices, and let : Sym„(IR) 
D? be such that for A e Sym„(IR), mn[A) is equal to the lowest eigenvalue of A, or equivalently 

mn[A)= inf Ax. (6.23) 

x£R", |x1=1 

Lemma 80. m„ is Lipschitz from (Sym„(IR), 11 • | loo) to (R, | • |) with a Lipschitz constant 1. 
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Proof. For A, Be Sym„(IR) and x e IR", |x| = 1, we have Ax = x^Bx + x^{A-B]x, so that 

x^Bx- ||A-B||oo ^ x^Ax < x^Bx + ||A-B||oo (6.24) 

and thus 

m„(B)-||A-B||oo< m„iA)< m„(5) + ||A-5||oo (6.25) 

and 


|m„(B)-m„(A)| < ||A-5||oo. (6.26) 

□ 

Lemma 81. If the entries of some matrices Mn e Sym;(IR), n e N+, converge to the respective 
entries of a positive definite symmetric matrix MeW^^, then for a sufficiently large n, M„ is 
positive definite. 

Proof. This follows from the fact that A e Sym^ (R) is positive definite only if m; (A) > 0, and 
from Lemma 80, lim„^oo ?^/(M„) = m;(M). □ 

Theorem 82. If Condition 76 holds, then under Condition 32, a.s. for a sufficiently large n, 
rn0<n) holds for rn as in Definition 74. In particular, a.s. lim„^ooP’(?'n{^?j)) = 1- 

Proof. Let K = exp(- max, y=i , ^ | G, j-1). Then, K>Q and the entries of the matrix 11(Z 0) GK 
are hounded and thus Q'-integrahle. Thus, from Lemma 79 for § = Q', EnjT'DiZ 0)KG) is 
positive definite. Let A„ = (KZ 0)KG)n{Kn). From the SLLN, a.s. 

lim A„ = EQ'mZ^0)KG). (6.27) 

Thus, from Lemma 81, a.s. A„ is positive definite for a sufficiently large n and the thesis follows 
from the second point of Lemma 73. □ 

6.4 Discussion of Condition 76 in the Euler scheme case 

Let us consider IS for an Euler scheme with a linear parametrization of IS drifts, discussed 
in Section 5.5 helow formula (5.57). In this section we shall reformulate Condition 76 and 
provide some sufficient assumptions for it to hold in such a case. 

Let us define a measure v on ,5^(IR™) to be such that for each B e SSiU"^) 


T-1 


v(5) = Eu(ll(Z 7^ 0, 0 < T < oo) X UXk e B)) 

k=0 


l-l 

= 11 U(Z7^0, T = Z, X^eB) 
/£N+ k=0 
oo 

= ^ U(Z 7 ^ 0, k <T < oo, X]c e B). 
k=o 


(6.28) 
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Remark 83. From the second line of (6.28), Remark 78, and (5.60), = U(Ai,) = 0 is 

equivalent to 

vi{&b^0}) = 0 (6.29) 

(where {Qb 7 ^ 0 } = {xe [R™ : &{x)b^ 0 }). 

Remark 84. Let for each i e 0;: IR'” — IR"^ be the ith column ofQ and [0,]= be the class 

of equivalence of Qi with respect to the relation = of equality v a.e. on the set JF of measurable 
functions from 57'toST'lU'^). Then, from Remark 83, Condition 76 is equivalent to 
i = \,...,l, being linearly independent in the linear space 5^ I ^ defined in a standard way. 

Let us assume that m = n +I for some u £ N+. Consider the following condition concerning 
the IS basis functions f,: IR™ — IR"^, i = 1,..., Z, as in Section 5.5. 

Condition 85. For some mi, m 2 e N+, functions gij : IR — IR, Z = 1,..., mi, and g 2 ,i ■ R” —' R^, 
Z = l,...,m 2 , are such that for Ki = {kh : k e {!,...,mi}}, g\j\Ki> i = l,...,mi, are linearly 
independent and for each i e {l,...,mi}, for some open set K 2 J c R”, g 2 ,j\K 2 i> j = I,...,m 2 , 
are continuous and linearly independent. Furthermore, we have I = mi m 2 , and denoting 
n[i,j) = m 2 (Z-l) + j,foreachxeU" andte R wehaveTjiq.jfix, t) = g\,i{t)g 2 q{x), i = l,...,mi, 
j = l,...,m 2 . 

Remark 86. ^45 the functions gi,; as in the above condition one can take for example polynomi¬ 
als g\j{t) = t‘~^, i = l,...,mi. Formi =2 onecan also Msegi,i(f) = 1 andgifrt) = t^ for some 

p eN+. For n = I and arbitrary nonempty open sets K 2 ,i c R, Z = 1.m 2 , as the functions 

g 2 ,i in the above condition one can take e.g. polynomials analogously as above or Gaussian 
functions g 2 ,i (x) = a,- exp( —for some a,- e R \ {0}, s e R+, and pi e R different for different i 
(the linear independence of such Gaussian functions on each open interval can be proved by an 
analogous reasoning as in [1 ]). In particular, for such K 2 J, Condition 85 holds for the functions 
Ti equal to fi as in (5.82) or equal to ft such thatfi (x, t) = ?;■ (x), x e R”, f e R, forrt as in (5.81), 
where in the first case mi = 2, in the second mi = 1, and in both cases n=l and m 2 is equal to 
M as in Section 5.9. 

Let A denote the Lehesgue measure on R” and dx — the Dirac measure centred on x. 
Theorem 87. If Condition 85 holds and 

X» 6 ih«K 2 ,ix{ih}V, Z = l,...,mi, (6.30) 

then Condition 76 holds. 

Proof. Let h e R^ he such that ClAfi) = 0. Then, from Remark 83, v({0h 7 ^ 0}) = 0 and thus for 
Z = 1,..., mi, v({(x, ih) :xeK 2 ,i, 0(x, ih)b^0]) = 0 and from (6.30) 

A({x e K 2 ,i: 0(x, ih)b^ 0 }) = 0 . (6.31) 
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From (5.59) and Condition 85 we have for x e D?” and f e IR 

mi 1712 

0(x, t)b='/hY,Y. ^;r(;,fc)gl,;(t)g 2 ,fc(x). (6.32) 

j=lk=\ 

Denoting for i = 1,..., mi and k=l,...,m 2 

mi 

^i,k— ^ (6.33) 

;=i 

we thus have 0(x, ih)b = (ii,kg 2 ,kM, x e R”, and from (6.31), 

m 2 

A({x e K 2 j : ^ «Ug 2 ,fc(^) ^ 0}) = 0. (6.34) 

k=l 

Thus, for i = 1,..., mi, from the continuity and linear independence of g 2 ,k\K 21 < k=l,..., m 2 , 
we have = 0, fc= 1,..., m 2 . Therefore, from (6.33), for A:= 1,..., m 2 , ^Tiij,k)gi,j\Ki = 0, 
so that from the linear independence of gijiJTi, j = l .mi, we have h = 0. □ 


Let us assume the following condition. 


Condition 88. We haveum.i = 0, i = 1,. ..,d, pm = L (xo)m = 0, and a : R'” — is such that 
(7 ij — (T ij, i — . ,n, y = 1,..., d. 


Note that it now holds for = (X^ ,)”=o’ ^ ^ that 

Xt = iXk,kh), keN. (6.35) 

For X e R” and k e N for which cr(x, kh) has linearly independent rows, let Qfc(x) = [ha[x, kh)d{x, kh)^)~^, 
and for y e R", let 


Pfc(x,y) 


v/det(Qfc(x)) 

m 

{ 2 n) 2 


exp((y-x - hp(x))^Qfc(y-x- hp(x))). 


(6.36) 


Theorem 89. LetkEN+ and sets 81 , 82 ,...,B]c,C e S§(U") have positive Lebesgue measure. Let 

U a.s. the fact that Xi e 8i, 1 = 1. k andXj^+i e C imply thatZ 0 and k<T <00. Let further 

(t[x, t) have independent rows for each (x, t) e {xq} u Uf-^ Bi x {ih}. Then, 

dill X{//i} V, / = 1,..., fc. (6.37) 


Proof It follows from the fact that for each j e{l,...,k}, for each D c SS{Bj) such that A(D) > 0, 
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for D = BfX Dx nf=j+i Bi x C we have 

r ^ 

do i=0 

= U((l,)f+/ e D) 

~ j.,1 ~ (6.38) 

= U(Z ^0, k<T<oo, e £>) 

< U(Z ^0, j <j <oo, Xj e D) 

<v[Dx{jh}), 

where in the last line we used (6.35) and the last line of (6.28). □ 


Remark 90. Let us consider the problems of estimating an MGF mgf(xo), a translated committor 
cjAB,aixo)> (ind a translated exit probability by a given time pr.aixo) as in Section 5.7 for a -i. 
As xo, p, and a fulfilling the above Condition 88 let us consider Xq, p', and a’ as in Remark 
62, and as an Euler scheme X in the LETGS setting as above let us consider the time-extended 
process X' as in that remark. Note that the process X as in Section 5.7 is now equal to the above 
X. Let k eN+. Then for D and Z corresponding to the above expectations as in Section 5.7, 
assuming that C eSS{B) in the case of estimation of the translated committor, orC e SS{D') for 
the MGE or the exit probability, and additionally T>h{k+\) in the case of the exit probability, 
for each B e SS{D) we have thatXi eB, i = l,...,k andXj^+i e C implies thatZ ^ 0 andr = k-\-l 
(where for the exit probability rather than t we mean t' as in (5.73)). This holds also for Z and t 
replaced by their terminated versions Zg and Tg for seN+, s> k, as in Section 6.2. 

From remarks 86 and 90 and theorems 87 and 89 it follows that Condition 76 holds in all the 
cases considered in our numerical experiments as in Section 5.9 if for the case of the exit 
prohahility before a given time we assume that T > h{mi +1) for mi depending on the basis 
functions used as in Remark 86. Furthermore, this condition also holds in such terminated 
cases as in Section 6.2 for each s e N+, s > mi. 


6.5 Cross-entropy and its estimators in the LETGS setting 

Consider the LETGS setting. From (5.43), 


ce„ib’,b) = ZL'(h^Gh-i- Ffh-i- Kt = oo)ln(c))„ 

_ _ _ (6.39) 

= b^{ZL'G)nb+ [ZL'H)nb + \n{e)[ZL'%T = oo))„. 


so that 

Vb^nib', b) = 2{ZL'G)nb+{ZL'H)n. (6.40) 

Thus, b — CBnib',b){a)) has a unique minimum point b’^ e A only for w e G" for which 
{ZL'G)n{(D) is positive definite, in which case for A„(h') := 2{ZL'G)n and B„{b’) := -{ZL'H)n 
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we have 

b*„ = [A„[b')rHa))Bn{b'){a)). (6.41) 

Note that if Z > 0 then from the second point of Lemma 73 for K = ZL', for each o) e Q”, 
(ZL'G)„((y) is positive definite only if r„(to) holds (see Definition 74). 

Condition 91. ZG andZH have Qi -integrable (equivalently, V-integrable) entries. 

Lemma 92. Let Condition 68 hold for S = Z, let for some p > I, Eu(|Z|P) < oo, and let for 
some 5 e N+ and a random variable f with a geometric distribution under U with a parameter 
q e (0,1] it hold 

11(Z 7^ 0)t < f := s + f. (6.42) 


Then, for each I < u < p, we have Eiu(|ZLf;|“) < oo and Eu(|ZG;j|“) < oo, i,j e In 

particular. Condition 91 holds. 

Proof. Let I < u< p. For r £ (m,oo) such that “ + | = 1, using Holder’s inequality and (6.42) 
we have 


EudZIlGlle 


<Eu((|Z|t-1?^ 


ii'\ 


< (-i?^)“(Eu(|Z|P))p(Eu(r))^ <oo 


and 


Eu((|Z||H||ool)“)<Eu((|Z|i?X lfifcl)“)^^“(Eu(ZP))^(Eu((E IqkWW- 

k=l k=l 


Furthermore, 

T OO I OO I 

eu((E \hk\V) = E Mif= /)(E ^hkW) < E i) E \hk\’') 

k=\ l=\ k=l 1=1 k=l 

and from Schwarz’s inequality, 

I 1 ' 1 

EuIKt = Z) E Ifi-tD ^ u(f = Z)HEu((E 

k=l k=l 


(6.43) 


(6.44) 


(6.45) 


(6.46) 


ItholdsU(T = s+A;) = q{l-q)’^ \ fce N+,andEu((LEi = lEuQqif''+l{l-l){Evi\qi\'')f ■ 

The thesis easily follows from the above formulas. □ 

Note that (6.42) in the above lemma holds e.g. for s = 0 and f as in Theorem 63 if the as¬ 
sumptions of this theorem hold for B = {0}, or for f = 0 for t being an arbitrary stopping time 
terminated at 5 as in Section 6.2. 

Let us assume conditions 52 and 91. Then, from (5.43) we receive the following formula for 
the cross-entropy 


ce(fi) = EQj(Zln(L(h))) = (ZG)fi-H Eq^ (ZH)fi. 


(6.47) 
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Let A = 2 Eqj (ZG) = V^ce(i>), beU^, andB = -Eq^{ZH). Then, wehave Vce(h) = Ab-B. Thus, 
if Eqj (ZG) is positive definite, then ce has a unique point b* e A, satisfying 

b* = A~^B. (6.48) 

If Z > 0, then from Lemma 79 for K = Z, Eqj (ZG) is positive definite only if Condition 76 holds. 
Remark 67 applies also to the above discussion in the LETGS setting. 

6.6 Some properties of expectations of random functions 

Some of the below theorems are modifications or slight extensions of well-known results; see 
the appendix of Chapter 1 in [53]. 

Let Z e N+ and A e be nonempty. A function /: A — IRis said to be lower semicontinuous 
in a point be A if liminfjc^b/(x) > fib), and it is said to be lower semicontinuous if it is lower 
semicontinuous in each be A. 

Lemma 93. A lower semicontinuous function / : A — R such that f > -oo (i.e. fib) > -oo, 
be A) attains a minimum on each nonempty compact set K c A (where such a minimum may 
be equal to infinity). 

Proof Let m = inffoej<:/(h) and let an^K, n e N+, be such that lim„^oo/(^?j) = Consider 
a subsequence ianfikm+ of io-nlnm^, converging to some b* e K. Then, from the lower 
semicontinuity of/, m = liminffc^oo/i^s^Wi) ^/(fi*). so that/(fi*) = m. □ 

Condition 94. A(random) function h\5^ iA)^i'D.,t^)-* S^iW] is such that a.s. b inA-* hib) := 
hib, ■) is lower semicontinuous and 

E(sup(fi(fi)_)) < oo. (6.49) 

b&A 

For such ah we denote b e A ^ fib) := Eihib)). 

Lemma 95. Assuming Condition 94, we have f > -oo and f is lower semicontinuous on A. 

Proof From (6.49), / > -oo. For each be A and a„ e A, n e N, such that lim„^oo = h, from 
Patou’s lemma (which can be used thanks to (6.49)) and the a.s. lower semicontinuity of 
b^ hib), 

fiminf/(a„) > E(liminffi(a„)) > fib). (6.50) 

n^oo n^oo 

□ 

Let further in this section A c R^ be open. For x e A, let dx = infye^/ ly - x|. For a sequence 
Xfi^ A, ne N+, let us write x„ j A if max(- 5 ^, |x„|) ^ oo as n ^ oo, i.e. x„ in a sense tries to 

_ _ ^xn 

leave A. For ae Rand /: A^ R, let us denote by lim^ta/W = athefactthatlim„^oo/Un) = 
whenever x„ j A. 

Condition 96. A lower semicontinuous function f: A — U fulfills f > -oo and\imx\Afix) = oo. 
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Condition 97. Condition 96 holds, the setB on which f is finite is nonempty and convex, and 
f is strictly convex on B. 

Lemma 98. Under Condition 96, f attains a minimum on A and if Condition 97 holds, then 
the corresponding minimum point b* is unique and f{b*)<oo. 

Proof. Under Condition 96, for a sufficiently large M > 0, for a compact set 

K={beA:\b\<M, db>^}, (6.51) 

we have inib^Afib) = infb^K fib). From Lemma 93 there exists a minimum point b* e K of f 
on K and thus also on A. Under Condition 97 we have fib*) < oo and the uniqueness of b* 
follows from the strict convexity of /. □ 

Lemma 99. Assuming Condition 94, if with positive probability\\mb\Abib) = oo, then\imb\Afib) 
oo. 


Proof. For t A from Fatou’s lemma 

liminf/(flfc) > E{liminffi{a/t)) = oo. (6.52) 

k^oo k^oo 

□ 

Lemma 100. Under Condition 94, let us assume that A is convex, for some bo e A, f {bo) < oo, 
and a.s. b — h{b) is convex. Then, f is convex on the convex nonempty setB a A on which it is 
finite. If further with positive probability b — h[b) e IR is strictly convex andlimb^A hib) = oo, 
then f satisfies Condition 97. 

Proof. The (strict) convexity of / and the convexity of B easily follow from f{b) = E{h{b)). The 
remaining points of Condition 97 follow from lemmas 95 and 99. □ 

6.7 Some properties of mean square and its estimators 

Let us consider the mean square function and its estimators as in sections 4.1 and 4.2 (under 
appropriate assumptions as in these sections). 

Condition 101. A is convex and be A^ L{b) (to) e R is convex and continuous, to e Ui. 

From (4.21), if Conditon 101 holds, then b — tnsq„(fi', b) is convex and continuous (for each 
y e A and when evaluated on each to e Q”). 

Definition 102. For A open and convex, let the random condition Pmsq on SA\ hold only if 
Z 7 ^ 0, — L{b) e R+ is strictly convex, and limj,|^L(fi) = oo. 

Remark 103. If Condition 101 holds and for some n e N+, to = (to,)”_j e U”, and i e {!,..., n}, 
Pmsqioji) holds, then for each b' e A, b^ insq^ ^ strictly convex, continuous, and 

limi,|^tnsq„(fi',fi)(to) = oo. 
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It holds 


msq2„{h',h) 


1 " 


i=l 


zfU^Liib) X 






ib] 


+ -({ZL')2)„. 
n 


(6.53) 


Thus, b var„ [b', b) and b msq2„ [b', b) are positively linearly equivalent to h ^ fvai,nib', b) 
for 


I 


/v; 


i{b',b) = ^ 
i=l 


I 


z^l' 

J 

V ;£{!. n},j^i 


L'jLi[b)\ 


ib) 


b', be A. 


(6.54) 


Condition 104. For each a»i,t 02 e Oi, b — is convex. 

Note that if Condition 104 holds, then b /var.n ib', b) is convex and thus so are b var„ ib', b) 
and insq2„(h',h). 

Remark 105. Let us assume Condition 32 and that b e A—>- L(h)(to) e [R is continuous, o e 
til. Then, for each n e N+, from msq„ being nonnegative. Condition 94 holds for hib,-) = 
msq„ ib', b) iKn),for which f = msq in that condition. 

Lemma 106. Under Condition 101, if Qiipmsq) > 0 and for some b e A, msq(h) < oo, then 
f = msq satisfies Condition 97. 

Proof. It follows from Remark 103, Remark 105 for n = 1, and Lemma 100. □ 


6.8 Mean square and its estimators in the ECM setting 

Let us consider the ECM setting as in Section 5.1 for A open. As discussed there, for each 
0 ) eD. 1 , b ^ Lib)ia)) is convex, and under Condition 35, b — Lib)ia)) is strictly convex. Thus, 
under Condition 35, for each to e Qi, Pmsq(^) holds (see Definition (102)) only if Z(to) o 
and limi,|^ L(fi) (to) = oo. Note that for X having a non-degenerate normal distribution under 
Qi, for each to e Qi, lim|^|^oofi(^)(^^) = oo, so that pmsq holds only if Z 0. For X having the 
distribution of a product of n exponentially tilted distributions from the gamma family under 
Qi, we have Qi(X e IR”) = 1, and for to e Di such that Z(to) e IR”, we have L(fi)(to) — oo as 
b J A. Thus, for such an to, pmsqit^) holds only if Z(to) 0, and the condition Qi(pmsq) > 0, 
appearing in Lemma 106, reduces to Qi (Z 0) > 0. For X having a Poisson distribution under 
Qi, we have \im\h\^ooLib)ito) = oo when X(to) e N+, but not when X(to) = 0. Thus, in such a 
case pmsq holds when Z 0 and X e N+, but not when X = 0, and we have Qi (pmsq) > 0 only 
if(Q)i(XeN+, Z7^0)>0. 

Remark 107. Let us assume Condition 36. Then, for each n e N+ and b' e A, 

Vb^nib', b) = iZ2L'ipib)-X)Lm„ (6.55) 


and 


V|rnsq„(fi',fi) = iZ^L'iZib) + ipib) -X)ipib) -X)T)Lm„. 


(6.56) 
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Let us further in this remark assume Condition 35, so thafLib) is positive definite. Then, for 
to e Q” such that Z{a)i) 0 for some i e {\,...,n}, V|rnsq„{fo', fo)(to) is positive definite for each 
y, be A. Indeed, in such a case for each veU^ \ {0} we have 

mgq„ ih', b)v= v'^nb) viZ^L’imn + iZ^L’Lib)iip{b) -X)^ vfi)„ 

_ (6.57) 

> v'^I.{b)viZ^L’L{b))„>0. 

Note that Condition 104 holds for ECM since for each toi, t 02 e Oi and be Awe have 

—^ = exp{h^(X{t 02 ) - X{toi))). (6.58) 

Lib) { 0 ) 2 ) 

In particular, as discussed in the previous section, the estimators of variance and the new 

estimators of mean square are convex. For each n e N+, to e D.", and i,j e {1. n}, let us 

denote 


Vj,iia)) = Xia)j)-Xi(ji)i). 


(6.59) 


For each n e N2, let a function gvar.n : A x IR^ x F2” — IR he such that for each b' e A, be IR^, and 
to e Q” 


n ' 

gwai.nib',b)i(x)) = '^ (Z^i')(t0/) Y. i'(toy)exp(h^Oj,^•(^o)) • 

1=1 V 

Note that for each b' and to as above, h e IR^ ^ gvar,« (h^ h) (to) is convex and 

gvar.nib , h)(to) = fvai.nib , h)(to), b e A. 


(6.60) 


(6.61) 


For A = IR^ we have gvar.n = fvar.nr but in some cases, like for the gamma family of distributions 
as in Section 5.1, we have A^U^ and /var.n is only a restriction of gvar.n- For each b', b, and to 
as above, it holds 


^bgvar.nib ,h)(to) — Y. 


i=l 


(Z^L') (to,) 


Y L'id) j)vpi (to) eycpib^ Vpi (to)) 


(6.62) 


and 


n ' 

v|gvar,n(h',h)(to) = ^ (Z^l')(to,) Y L'[cjfivj,i[ct))Vjjia))'^exp[b^Vjjicj)) 

2 = 1 I /£{1,...,2I},_/Vl 


(6.63) 


Fet n e N 2 and to e F2”. Fet D(to) e be a matrix whose (t- l)n+ 7 th column, i,j e {l,...,n}, 
is equal to KZ ^ 0)(to, )i;j,,(to). 

LemmalOS. IfD{a)) has linearly independent rows, thenforeachb e andb' e A, V^gvar,n(h', h)(to) 

is positive definite. 

Proof. If D(to) has linearly independent rows, then for each f e IR^ t ^ 0, there exist i,j e 
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i j, such that 7 ^ 0)i(ji)i)vjj[a)) 0, so that from (6.63), 
fMgvar,n(h',h)(to)t > (z2L')(aj,)l'(wj)(r^i;j,,(w))2exp(h^i;j,,(w)) > 0. (6.64) 

□ 

Let for each fc e u} a matrix D{k,(D) e have the consecutive columns equal to 

V]cj(.a)) for 7 = 1 , 2 ,..., fc- 1 , A;+ 1 ,fc + 2 ,..., n. 

Lemma 109. D(_a)) has linearly independent rows only if for some / e {1,..., n}, Z{a)i) 0 , and 
for some (equivalently, for each) ke {l,...,n}, D{k,(ji)) has linearly independent rows. 

Proof If Z[(ji)i) = 0, i = l,...,n, then Did)) has zero rows, so that they are linearly dependent. 
The dimensions of the linear spans of the columns and vectors of a matrix are the same, so 
that the matrices Diw) and Dik,a)), ke{l,...,n}, have linearly independent rows only if the 
dimension of the linear span of their columns is equal to 1. Thus, the thesis follows from the 
easy to check fact that the linear span V of the columns of D{k, w) for different A; e {1,..., n} is 
the same and if Z(a»;) 0 for some i e {1,..., n}, then the linear span of the columns of Dito) is 

equal to V. □ 

For a vector v e IR'”, hy z; < 0 we mean that its coordinates are nonpositive. 

Theorem 110. If the system of linear inequalities 

D^(w)h<0, beu\ (6.65) 

has only the zero solution, then for each b’ e A, gvar.nlh', b)(a)] — 00 as|h| — ooandV|gvar,n(h', h)(a») 
is positive definite, b&U^.Ifbisa solution of (6.65), then for each b’ e A, aeU^, and t e [0, 00 ), 
we have 


gvai,nib’,a+ tb){(D) < gvar,n(h',a)(to). (6.66) 

Proof. For beU‘ for which (6.65) holds, for ie n} such that Z{(i)i) ^ 0, for j e {1,..., n}, 
i j, we have b^Vjjio)) < 0, so that (6.66) follows from (6.60). Let further (6.65) have 
only the zero solution. Then, D(to) has linearly independent rows, and thus the positive 
definiteness of V|gvar,n(h', h)((y) follows from Lemma 108. Consider a function b 
fib) := max{b^Vjjio)) : Z(to,) 0, i,j e {l,...,n}, i ^ j]. Then, for each b e R^, b ^ 0, it 
holds fib) > 0. Thus, from the continuity of / we have 0 < 5 := min{/(h) : |h| = 1} and for 
0< a-.= mm{iZ^L')ia)i)L'ia)j): Ziwi) y^O, i,je{l,...,n}, i j}, from (6.60) 

gvai.nib',b)ia)) > aexp{5|h|) ^ oo (6.67) 

as|h|^oo. □ 

There exist numerical methods for finding the set of solutions of (6.65) and in particular for 
checking if it has only the zero solution; see [33]. 
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Theorem 111. Let us assume that 

Zia)i)^0, i = ( 6 . 68 ) 

Then, (6.65) has only the zero solution only ifD{a)) has linearly independent rows, which from 
Lemma 109 holds only if for some (equivalently, for each) k e {\,...,n}, D[k,a)) has linearly 
independent rows. 


Proof. Assuming ( 6 . 68 ), for b e D^{a)]b < 0 holds only if 

Vij-iw)^b<0, i,je{l,...,n}. (6.69) 

Since fij-iw) =this holds only if i;,j(to)^h = 0 , i,je{l,...,n},i.e. only if D^(to)h = 
0 . □ 


6.9 Strongly convex functions and e -minimizers 

For some nonempty A c consider a function f: A—^U. For some c > 0, we say that x* e A 

is an e-minimizer of /, if 

fix*) < inf fix) + e. (6.70) 

x^A 

Consider a convex set S c A, such that A is a neighbourhood of S (i.e. S is contained in some 
open set Dc A). Then, / is said to be strongly convex on S (where we do not mention S if it is 
equal to A) with a (strong convexity) constant m > 0, if / is twice differentiable on S and for 
each beU^ and x e S 

b'^y^fix)b>m\bf. (6.71) 

Let us discuss some properties of strongly convex functions / on S as above (see Section 9.1.2. 
in [9] for more details). It is well known that / as above is strictly convex on S, and from Taylor’s 
theorem it easily follows that for x, y e S 

fiy) > fix) + (V/(x)) ^(y - X) + y ly - x|2. (6.72) 

In particular, fiy) ^ oo as |y| ^ oo, y e S. Furthermore, if V/(x) = 0, then 

TYt 

fiy)>fix) + -\y-x\^. (6.73) 

Thus, X is a unique minimum point of fs only if V/(x) = 0. The right-hand side of (6.72) in the 
function of y e IR^ is minimized by y = x - 7 ^ V/(x), and thus we have 

T7X 1 

fiy) > fix) + (V/(x))^(y- X) + — ly- xl^ = fix) - — |V/(x)|2. (6.74) 

2 2m 
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Let / have a unique minimum point b* e S. Then, from (6.74) for y = h*, for x e S we have 

f{x)<f[b*) + ^\yfix)\^. (6.75) 

2m 

In particular, each x e S is a ^|V/(x)|^-minimizer of /. 

6.10 Mean square and its well-known estimators in the LETGS set¬ 
ting 

Let us in this section consider the LETGS setting. 

Theorem 112. Let b', h e n e N+, and o) e LI". Then, h e IR^ — fib) := insq„(h', h)(a») is 
convex and z/r„(a>) holds (see Definition 74), then f is strongly convex. Ifrnicv) does not hold, 
then there exists beU^ \ {0} such that 

fia+tb] = fia), (6.76) 

Proof. It holds 

y^fib) = iZ^L'iZG +i2Gb + H) [ZGb + H)^ )I(h))„ (w), (6.77) 

which is positive semidefinite, so that / is convex. If r„(a)) does not hold, then from the first 

point of Lemma 73 there exists beU‘ such that for each i e 1. n„ Aijiwi) does not hold, so 

that from Lemma 72 and (4.21) we receive (6.76). Let us assume that rnicv) holds. Then, from 
the second point of Lemma 73 for K = Z^L', the matrix M := iZ^L'G)„i(ji)) is positive definite. 
Let mi > 0 be such that b^Mb > mi\b\^, b e R^ For each i e{l,...,n} such that T(fc»;) < oo, 
from Remark 59 we have 

m 2 ,; := inf L(h)(fc),) = exp( inf ln(L(fi){w;))) e R+. (6.78) 

Let m 2 = minima,;: J e U. n}, riot) < oo}. Then, m 2 e R+ and for each a,beU^ we have 

a^y^fib)a> 2a^iZ^L'GLib))ni(ii)a> 2m2a^iZ^L’G)nib))a> 2mim2|fl|^. (6.79) 

□ 


TheoremllS. If conditions 32 and 76hold, then a.s. for a sufficiently large n, — msq„(h', fi)(K„) 
is strongly convex. In particular, the probability of this event converges to one as n^oo. 

Proof. It follows directly from theorems 82 and 112. □ 

Theorem 114. LetGondition 52 hold. If msqibo) < 00 for some bo e R^ then msq is convex on 
the convex nonempty set B on which it is finite and if further Gondition 76 holds, then f = msq 
satisfies Gondition 97 (in particular, from Lemma 98, it has a unique minimum point). If 
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Condition 76 does not hold, then there exists b 0, such that 

msq(fl+th) = msq(a), aeU\teU. (6.80) 

Proof. The first part of the thesis follows from theorems 112 and 113, the properties of strongly 
convex functions discussed in Section 6.9, Remark 105, and, under Condition 32, from Lemma 
100 for h[b,-) = msq„[b', b){Kn) for a sufficiently large n. If Condition 76 does not hold, then 
there exists beU^, b^O, such that Qi [Ajj) = 0, for which (6.80) follows from Lemma 72 and 
formula (4.9). □ 

6.11 Smoothness of functions in the LETS setting 

In this section we provide some sufficient conditions for the smoothness and for certain 
properties of the derivatives of functions defined in Section 4.1. Unless stated otherwise, we 
consider the LETS setting, which contains the LETGS setting as a special case (see Section 
5.3.4). Erom Remark 61, the ECM setting for ^ = IR^ can be identified with the LETS setting for 
T = 1 and Ao = /;, so that it is easy to modify the below theory to deal also with such an ECM 
setting. 

Condition 115. A measurable function S:SP\^ 5C{W) is such that conditions 68 and 69 hold 
and for each 6 e (R^)^ 

Eu(|S|exp(X0fl(r?;)))<oo. (6.81) 

i=i 

Note that Condition 115 implies that S is U-integrable. 

Remark 116. In the special case which can be identified with the ECM setting for A = R^ as 
discussed above, Condition 68 holds for R = I and Condition 69 holds for s = 1. Thus, for 
someS:5^i a counterpart of Condition 115 in the ECM setting for A = U^ reducesto 

demanding that 

Eu(|S|exp(0^X))<oo, deU‘. (6.82) 

Remark 117. Since for each s e N+ andOe (R"^)^, Eu(exp(L ;=2 6f^(r?;))) = exp(L;= 2 'E(6/)) < 
oo, from Holder’s inequality, (6.81) holds if we have Ey (I S| 4) < oo for some q e{l,oo). 

Condition 118. We have t, s e N+ and f : {5A (R^)) ^ i8> ,5^ (R) is such that for each M e R+, 

for some N £ N+, (p e ((R"^)^)-^, and u e R^, we haveV a.s. 

N s _ 

sup \f{b) \ < ^ Ui exp(^ (pljXiq /)) (6.83) 

beiRb‘:\bi\<M, i=l j=l 

(where fib) = fib,-))- 
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Remark 119. Let Condition 118 hold and S satisfy Condition 115 (for the same s). LetM e IR+ 
and consider the correspondingN, (p, and u as in Condition 118. Then, for each 6 e 

^ ^ N s 

Eu( sup \Sf[b)exp{Y^6jX[T]j))\)< ^ UjEu{Sexp(^ (0/,j+0j)^X(r/y))) < oo. 

j=l i=l j=l 

(6.84) 


In particular, Eu(supj,£(R/)f.|^.|<j\^ l'S/(^)l) < oo. Furthermore, from the above M e IR+ 

being arbitrary, for each b e (R^) ^ Sf{b) satisfies Condition 115. 

In this work we assume that = 1, x e R. 

Theorem 120. Let conditions 68 and 69 hold, let t e N+, r e R^ lu e u e z e 

V e cind q e 11^=1 Letforeachbe (R')^ 


t l TAS d 

fib] =i(s ^ 0)1 n iLibmV”' n Ky n (fi 

m=l i=l ’ i=l j=l 

I ym,i 

• n (A.-i)" 7 "'^) n (5.„,-,'i'(A;-i(h„)))^-'V))i. 

fc=l 7 = 1 


(6.85) 


Then, Condition 118 holds for such an f (for the same t and s as above). 

Proof. Let M e [0,oo) and g(x) = e^ + e~^. For p e R and h e R^ |h| < M, from (5.42) we have U 

a.s. 

l(S7^0)LP(h)<.S:(p):=exp(|p|5F(RM))n (6-86) 

;=i;=i 


Let for X e [0, oo) and aeN^ 

Uaix) = l+ sup \da^ib)\, (6.87) 

beR( \b\<x 

which is finite thanks to Remark 37. Then, for each b e (R^) ^ | fi, | ^ M, i = 1,..., m, we have U 

a.s. 


\fib)\< 1(S 7^ 0) n iKirm)M^‘-^ f\iY[igiXjiqi)fy'-'‘’^ (1 + R)^i=i^-.‘-.M) 

m=l i=l j=l 

, ( 6 . 88 ) 

ym, i 

• n (RiW)^-. 0 ). 

y=i 

The right-hand side of (6.88) can he rewritten to have the form as the right-hand side of 
(6.83). □ 

Theorem 121. If conditions 68 and 69 hold, then for each t e N+, pi,p 2 e R^ such that pzj > 1, 
i = l,...,t, MeU+, V e (N')', and for hp^,p^ib,a)] := (1(5 0)IlU\du,iLP^-‘ibi))fy-‘)io), b e 
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(U^)^,ci)eE, Condition 118 holds for f = hp^^p,^. 

Proof. Since for ps e {N+)such that psj > p 2 ,i, i = have 

t 

I(h)I < 1{S 7 ^ 0 ) n (1 + (^;)), (6.89) 

i=\ 

it is sufficient to prove the above theorem for p 2 ^ In such a case hp^^p^ib) is a linear 
combination of a finite number of variables as in (6.85). Thus, the thesis follows from Theorem 
120 . □ 

Theorem 122. If Condition 115 holds, then for each p e U,forg[b) = SL^ib), beU^ fib) = 
IEu(g(h)) e IR is smooth and we have dy fib) = Euidygib)), v , beUK 

Proof It follows from Theorem 121 for p\ = p and P 2 = 1 and from Remark 119 by induction 
over Vi using mean value and Lebesgue’s dominated convergence theorems. □ 

Theorem 123. If Condition 115 holds 

1. forS= 1, then 1 = Eu(i“^(fi)) and for each veN^, v^O, EuidyiL~^ib))) = 0, fie IR^ 

2 . forS = Z^, then msq is smooth anddyvcisqib) = EuiZ^dyLib)), beR^, ve 

3. forS = C, thenb^ c(fi) is smooth and dycib) = E\uiCdyiL~^ib))) = EufliC 7 ^ oo)Cdy(I“^(fi))), 
fieR^ veN^, 

4. for S equal to Z^ and C, then ic is smooth. 

Proof. The first three points follow from Theorem 122 and from the fact that due to remarks 
70 and 54, we have 1 = Eu(L(fi)“^), msq(fi) = Eu(Z^L(fi)), and c(fi) = Eu(CL“^(fi)) respectively, 
fi e and in the third point additionally (4.27). The last point is a consequence of points two 
and three. □ 

Theorem 124. In the ECM setting for A = R^ let us assume that Qi (Z 7 ^ 0) > 0, conditions 35 
and 36 hold, and we have (6.82) for S = Z^. Then, V^msq(fi) exists and is positive definite, 
fie r'. 

Proof. From a counterpart of Theorem 120 and Remark 119 for ECM, W = Z^Lib)ipib) - 
X) ipib) - X) ^ has integrable entries. Thus, from the second point of a counterpart of Theorem 
123 for ECM 

V^msqifi) = EQ^(z2l(fi)(v2'E(fi) + ipib) -X)ipib) -X)^)) 

, (6.90) 

= v 2 'F(fi)msq(fi) + EQj(lV). 

For veR^, v^E(Q^iW)v = EQj(Z^L(fi)((V'l'(fi)-X)^i;)^), so that Eqj(IV) is positive semidefinite. 
Thus, the thesis follows from the fact that as discussed in Section 5.1, V^'P(fi) is positive 
definite and from msq(fi) e R+, fi e R^ □ 
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Theorem 125. In theLETGS setting, if Condition 115 holds for S = and Condition 76 holds, 

then for a positive definite matrix 

M = EQ^(2Gz2exp(-i ^ ^ (6-91) 

1=1 

V^msq(h) - M is positive semideflnite, b eUK In particular, msq is strongly convex with a 
constant m equal to the lowest eigenvalue ofM. 

Proof From the second point of Theorem 123, we have 

y^msqib) = EQ^[Z^[2C+{2Cb+ H)[2Cb+Hf)Lm. (6.92) 

From Theorem 120 and Remark 119, Z^G and W := Z^{2Cb+ II)[2Cb + H)^)L{b) have Qi- 
intergrable entries, and from 'i[Z ^ 0)|exp(-iXj=i \qi?')\ ^ 1> so does Z^Gexpi-^^T^^ \qi?')- 
Furthermore, v^EiQ^iW)v = EiQ^iZ^L(b)ii2Cb+H)^ v)^), r'elR^i.e. Eq^CVF) is positive semidefi- 
nite. From Lemma 79 for kT = 2Z^exp(-|Xj=i \qi\^)> tVt is positive definite. Furthermore, from 
Remark 59, for each v e IR^ f ^Eq^ (2GZ^L(h)) v > v^Mv, and thus also ii^CEqj (2GZ^L(h)) - 
M)v>0. □ 

6.12 Some properties of inefficiency constants 

Let us consider the inefficiency constant function and its estimator as in sections 4.1 and 4.2. 

Condition 126. It holds infc{b) = Cmin ^ II?+- 

Condition 127. For some Cmin wehaveC{(ji))>Cmin>(^^f^i- 

Note that Condition 127 implies Condition 126 for Cmin = Cmin- 

Remark 128. Note that in the Euler scheme case as in Section 5.5, for t being the exit time of 
the scheme of a set D such that xq e D, for s e IR+ and C = st. Condition 127 holds for Cmin = s. 

Under Condition 126 we have ic > var and thus if further A is open and lim;,|^var(h) = oo, 

then limbi^ic(h) = oo. Note also that if c and var are lower semicontinuous (which from 
Lemma 95 holds e.g. if h — L(h)(w) is continuous, w e L2i) then ic is lower semicontinuous 
as well. Thus, if further A is open and limj,|^ic(h) = oo, then from Lemma 98, ic attains a 
minimum on A. 

Remark 129. Let us assume thatwax has a unique minimum point b* e A. If for some be Ait 
holdsicib) < ic{b*), then b^b* and thusvaxib*) < var(h), so that we must have c{b) < c{b*). 
Note that ifvax, c, andic are differentiable (some sufficient assumptions for which were discussed 
in Section 6.11), then a sufficient condition for the existence ofb e A such thaticib) < icib*) 
is thatVicib*) ^ 0. SinceS/vaxib*) = 0, we haveVic{b*) = var(h*)Vc(h*), so thatVicfb*) ^ 0 
only ifwaxib*) 0 andVcib*) 0. 

Remark 130. Letcib) >0,be A, letvax have a unique minimum point b*, and letvax[b*) = 0 
and cib*) < oo. Then, b* is also the unique minimum point of ic and we have ic(h*) = 0. If 
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further A is open, msq and c are twice continuously differentiable, and msq(&*) is positive 
definite, then from 

V^ic(h) = {V^c(i>))var(h) + c{h)V^msq{&) + {Vc(fo)){Vmsq(&))^+(Vmsq(i>)){Vc(h))^, (6.93) 
we have 

y'^ic{b*) = c{b*)y'^msq{b*), (6.94) 

and thus ic(h*) is positive definite. 
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Mi ni mization methods of estimators 


and their convergence properties 


7.1 A simple adaptive IS procedure used in our numerical experi¬ 
ments 

Let us describe a simple framework of adaptive IS via minimization of estimators of various 
functions from Section 4.2, shown in Scheme 1. A special case of this framework was used in 
the numerical experiments in this work. In the further sections we discuss some modifica¬ 
tions of this framework which ensure suitable convergence and asymptotic properties of the 
minimization results of the estimators. 

Consider some estimators ^t^, k e Np, as in (4.13). Let boC A, ke N+, U; e N+, i = 1,..., k, and 
N = eN+. Let hi, i = l,...,k,he some A-valued random variables, defined in Scheme 1. 
Let us assume Condition 19 and let for i = l,...,k+ I and j = /3, j be i.i.d. ~ Pi and 

Xij = Let us denote Xi = i = l,---,k+l For k = 1 we call the inside of 


Scheme 1A scheme of adaptive IS 

for i := 1 to k do 

Minimize b — ^t„, (fi;-i,fi)(x;), e.g. using exact formulas or some numerical minimiza¬ 
tion method started at Let bi be the minimization result. 

end for 

Approximate a with 

(ZI(fifc))iv(lfc+i). (7.1) 


the loop in Scheme 1 single-stage minimization (SSM) and denote b' = bo, while for k > 1 we 
call this whole loop multi-stage minimization (MSM). 

Let us now consider the LETGS setting and ^ as in (5.35). Then, using the notation (5.37), (7.1) 
is equal to 

1 ^ 

-yiZLibk))^^^HPk+u). (7.2) 

'''' /=i 

From the discussion in Section 6.5, if A„. ibi-i)ixi) = L”=i 2{ZL(bi-i)G) ^ iPij) is positive 
definite, thenfor 5„(fi,_i)(yd = {Z LUbi-y)) ,), the unique minimum point 

rii j — i ’J 
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hi oih^ b){Xi) is given by the formula bi = {Am Thus, in 

such a case, finding h, reduces to solving a linear system of equations. For est replaced by 
msq, msq2, or ic, the functions b — ^t^^ (hi-i, b) ixi) can he minimized using some numerical 
minimization methods which can utilise some formulas for their exact derivatives. Let us only 
provide formulas for such derivatives used in our numerical experiments. It holds 

I rii 

Vbn^nSbi-ub){Xi) = — Y.iZ^Ubi-i)i2Gb + H)Lm^^‘-^Hl5i,j), (7.3) 


ylmsq„^[bi-i,b)[Xi) = — Y.{Z^L[bi-i){2G+{2Gb+H){2Gb+Hf)Lm^^‘-^HPij), (7.4) 


rii 


V?,msq2„.(h;_i,h)(X;) = Vbmsq„.(h,-i,h)(X;)— 

f2i J-'IO) 

- mSq„,. {bi-i,b) iXi) — E «2Gh + iPij], 


(7.5) 


;=i 


Lib) 


and for 


var„.(h,_i,h)(f;) 


-^(msq2„_.(h,_i,h)(f;)-{-E(ZI(h,_i))'^‘-i>(;S;j))2), 

m-i • m 


(7.6) 


yblCni{bi-i,b)[Xi) 


1 ^ Libj-i) 

rii - 1 Ub) 


C)‘^'"‘’(j6;j)Vi,msq2„.(h,-i,h)(Td 


I m 

— y{i2Gb + H) 
rii fr; 


I(h^-i) 

Ub) 


G)^^‘-^HPi.j)varniibi-i,b){Xi). 


(7.7) 


Formulas for the second derivatives of msq2„. (h;-i, •) and ic„j (h,-i, •) can also he easily com¬ 
puted and used in minimization algorithms, but we did not apply them in our experiments. 
When evaluating the above expressions one can take advantage of formulas (5.41), (5.44), and 
(5.45). 


7.2 Helper strong laws of large numbers 

In this section we provide various SLLNs needed further on. The following uniform SLLN is 
well-known; see Theorem Al, Section 2.6 in [49]. 

Theorem 131. Let Y be a random variable with values in a measurable space SA, let 1/ c 
be nonempty and compact, and let h : S^{V) ^ LZ 5^iW) be such that a.s. x — h[x, Y) is 
continuous and Elsup^^y \h{x, F) |) < oo. Then, for Fi, F 2 ,... i.i.d. ~ Y, a.s. as n-^ 00 , xeV ^ 
^ L”=i h[x, Yi) converges uniformly to a continuous function x e F E(h(x, F)). 

For each pe [ 1, 00 ] and IR - valued random variable (7, let 11/1 p denote the norm of (7 in (P). 
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We have the following well-known generalization of Holder’s inequality which follows from it 
hy induction. 


Lemma 132. Let neN+, letUi, i = beR-valued random variables, and let qie [1,oo], 

i = be such thatj^'l_, ^ = 1. Then, it holds 

int/di<nit/d^,-. (7.8) 

i=l i=l 

Let further in this section u; e N+, f e N+, and r e N+. To our knowledge, the SLLNs that follow 
are new. 


Theorem 133. Let Mi >0, i e N+, be such that 


OO 


I 


m 


< oo. 


(7.9) 


Consider a-fields ^ ^, i e N+, and R-valued random variables iffij, j = l...,ni, i eN+, 

which are conditionally independent given for the same i and different j, and we have 
< Mi < oo andE[i(/ij\^i) = 0. Then, for di = / e N+, wehavea.s. lim„^oo«n = 

0. 


Proof. From the Borel-Cantelli lemma it is sufficient to prove that for each c> 0 

OO 

^P{|a,-| >e) <oo. (7.10) 

;=1 

From Markov’s inequality we have 
E(af) 

P(|a;|>c)<-^, (7.11) 

so that it is sufficient to prove that 

OO 

^E(flf)<oo. (7.12) 

i=l 

Let us consider separately the easiest to prove case of r = 1. We have for j e N+, and j, I e 
{1,..., nd, j 7 ^ I, from the conditional independence 


Eiy^ijy/ijm = E[^ijm)E[yrijm) = 0, 


(7.13) 


and thus 


E[if/ijy/ij) = EiE[i(/ijy/ij\‘^i)) = 0 . 


(7.14) 
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Thus, for i e I 


>'+ 

I rii 


E{af) = + E 

;=i /</£{i,...,n} 


(7.15) 


Now, (7.12) follows from (7.9). 


For general r e N+, denoting Ji = {ve N"‘ : vj = 2r} and for v e Ji, we have 

l 11^ = 1 !'/• 

for ve Ji, from Lemma 132 for n= ni, Uj = and ^ = — 


2r’ 


E(| fl ^ fl (Eh/'f,-))^ ^ Mi. 
i=i ’ i=i 


(7.16) 


Thus, 


I 


^2r'' 


n. 


i v^Ji V ! 


E^fl V'y) < oo- 


i=i 


(7.17) 


For V e Ji such that f/t = 1 for some ke {1,..., denoting we have 

that y/i^ic and are conditionally independent. Furthermore, from Lemma 132 for n= rii, 
Uj = y/^J for j ^ k,Uic = 1, and i: = we have 


E{|i//;,.fc|)< n (E(l/^f,))^<CX). 


(7.18) 


;£{l,...,n,},7Vfc 


Thus, 


E(n Vu\^i) = Eami.kmmi.-kWi))= o, 

i=i 


(7.19) 


and E(n”ii ) = 0- Therefore, for Ji = {v e {N \ {!})”' : L”ii f / = 2r} we have 

J — ^ ^7 7—i 


^n, 1 V- (2r 

E{af'^) = — E 


E(nnv- 


(7.20) 


7=1 


Note that for e /;■ and p{v) := |{j e {1,..., nd : r’y 7 ^ 0}|, it holds p{v) < r, and thus for J. = {ve 
{0,2,3.2r}”' : piv) < r}, we have Ji c /(. Therefore, 

(n\ 

|/;|<l/'l< ' (2r)''<n[(2r)''. (7.21) 

I ^ / 

Furthermore, 

< (2r)!. (7.22) 


^2r'' 


V^7 
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From (7.20), (7.16), (7.21), and (7.22) 

E(flf)<^(2r)!(2r)''. (7.23) 

n. 

Inequality (7.12) follows from (7.23) and (7.9). □ 

Let Z e N+, let ^ e be nonempty, and let a family of probability distributions Q(b), be A, 
be as in Section 3.4. Let bi, i e N, be A-valued random variables. 

Condition 134. Nonempty sets Ki e SS{A), Z e N+, are such that a.s. for a sufficiently large i, 
bieKi. 

Condition 135. For each Z e N+, Xi,j> j = • • • > cire conditionally independent given bi-i 

and have conditional distribution Qii*) given bi-i = v (see page 420 in [18] or page 15 in [29] 
for a definition of a conditional distribution). Itholdsfi = iXi.fi'jLi’ * ^ ^+- 

Condition 135 is implied by the following one. 

Condition 136. Condition 19 holds and for each i eN+,fij = I,..., nt, are independent 

and independent of bi-i. Furthermore, Xi.j = fiPi,j:bi-i), j = l,...,n;, andfi = iXi,fi"Li’ 
i e N+. 


Let us further in this section assume conditions 134 and 135. 

Condition 137. A function h -.SCIA) A^{U) is such that for each ve A, EQ(„)(/r(r’, •)) = 0, 

and for 

Mi= sup E(}(w]ih{w,-f^), ieN+, (7.24) 

W€.Ki-i 

(7.9) holds. 

Theorem 138. Under Condition 137, for 

I rii 

b; = — X h{bi-i,Xi,k), i e N+, (7.25) 

k=l 

we have a.s. 


lim bi = 0. (7.26) 

i^oo 

Proof. Let for Z e N+, Zi; ; ^ x flj — D? be such that for each x e Qi, hi{v,x) = h{v,x) when 
V e Ki-\ and hi[v,x) = 0 when v e A\Ki-i. For 

I rii 

ai =— Y,hi{bi-i,Xi,k), (7.27) 

ni fctl 
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from Condition 134 we have a.s. bi - at = \hi-i € Ki-i)bi ^ 0 as i ^ oo. Thus, to prove (7.26) 
it is sufficient to prove that a.s. 


lim o-i = 0. (7.28) 

i^oo 

Let = hi[bi-\,Xij), i e N+, j = 1,From the conditional Fubini’s theorem (see 
Theorem 2, Section 22.1 in [18]) 

E(i/^f.) = E((EQ(,)(fif(t;, •))).=;,,_i) 

(7.29) 

= E(n(i; e JS:;_i)EQ(^)(fi2''(i;,.)))^^^_ j < M,-. 

Furthermore, y/ij, j = I,..., nt are conditionally independent given ‘Si := a[bi-i), and from 
some well-known properties of conditional distributions (see Definition 1, Section 23.1 in 
[18]), we have 


UVijl^Si) = {EQ(v){hi{v,-))]v=bi^i 

= mv e Ki-i)EQi,^)[hiv,-)))v=bi_i = 0 

Thus, (7.28) follows from Theorem 133. 

Theorem 139. Ifg is such that f{v) := EQ(i,)(g(r',-)) e IR, r" e A, and for 

Pj = sup EQ(i,)(g(y,-)^''), ieN+, (7.31) 

v^Ki-i 

we have 

oo p. 

y < oo, (7.32) 

tiK 

then Condition 137 holds for hiv, y) = g{v,y) - f{v), v e A, yeD.i. 

Proof Clearly, E(Q)(^) {h[v, •)) = 0, veA. Furthermore, for n e A 

EQ(.)(fi(n,-)^'') < EQW){\givr)\ + \fiv)\f’') 

<2^’'-%iu)ig{vrf’' + f{vf’') (7.33) 

<A^EQi^fgiv,-f''), 

where in the second inequality we used the fact that , a,fie [0,oo), pe [l,oo), 

and in the last inequality we used conditional Jensen’s inequality. Thus, M, < 4'^P; and (7.9) 
follows from (7.32). □ 

Condition 140. Condition 17 holds for Z replaced by some SeL^ (Qi) and for 

Pi= sup EQ(;,)((SL(t;))2'')= sup Eq^(S2''L(i;)2'-1), ieM+, (7.34) 

V€Ki-i VEKi-i 


(7.30) 

□ 


82 



7.2. Helper strong laws of large numbers 


we have (7.32). 

Theorem 141. Under Condition 140, a.s. 

lim {SLibk-i))ni:iXk) = ^QiiS). (7.35) 

k^oo 

Proof. This follows from Theorem 139 for gif,}/) = (SL(r'))(y), veA, y e tli, in which/(f) = 

Eqj (S), ve A, as well as from Theorem 138. □ 

_ 1 

For each K-valued random variable Y on^i and <7 > 1, let || F||q = Eqj{|F|^)‘'. 

Lemma 142. Letp, q e [l,oo] be such that + ^ = 1, let S e let Condition 17 hold for 

Z = S, and let for 

Ri= sup ms 7t0)L{vf’'-^\\q, ieN+, (7.36) 


it hold 

00 U. 

£ ^ < 00 . (7.37) 

Then, Condition 140 holds. 

Proof. From Holder’s inequality EQj(S^'’L(r')^''“^) < ||S^''||p||ll(S 7 ^ 0)L{v)^’'~^\\q, so that for P; 
as in (7.34) we have Pi < \ \S^''\\pRi. Thus, from (7.37), (7.32) holds for such Pi. □ 

The following uniform SLLN can he thought of as a multi-stage version of Theorem 131 and 
some reasonings in its helow proof are analogous as in the proof of the latter in Theorem Al, 
Section 2.6 in [49]. 

Theorem 143. LetV c be a nonempty compact set and leth : — 5C{U) be such that 

forQi a.e. w e Qi, h — hio, b) is continuous. LetYico) = supj^g^ |/z((y, h)|, ta e Hi, and let Con¬ 
dition 140 hold for S= Y. Then, a.s. aslc^oo,b^V ^ dkib) := hixk i,h)L{bk-i)iXk i) 

converges uniformly to a continuous function beV ^ a[b) := Eq^ [h[-, b)) e K. 

Proof Obviously, 

|h(fc), h)| < F(a>), (x)eTli,beV, (7.38) 

and for each be Kq, for Pi as in (7.34) for S = Y, 

EQ^(F) = EQp,)(FI(h))<(EQ(i,)((FL(h))2''))^<Pf < 00 . (7.39) 

Thus, for each ve V and Vk ^ V, k eN+, such that lim^^oo fk = v, from Lebesgue’s dominated 
convergence theorem and Qi a.s. continuity of h — h{-, b), 

lim a{Vk) = Eqj (lim hi-, Vk)) = aiv) e K. (7.40) 

k^oo k^oo 
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Thus, a is finite and continuous on V. Let e > 0. From the uniform continuity of a on V, let 


5 > 0 be such that 

\a{x)-a{y)\<e, x,y eV, \x-y\<d. (7.41) 

For each ycV and n e N+, let Bn,y = {x e 1^: |x- y| < ^}, and let for each o) e Qi 

= sup{|fi(a>,x) - h{ti),y) \ :xeB„^y}. (7.42) 

For Qi a.e. to for which fi(to, •) is continuous, lim„^oo rn,yi(^) = 0, y&V. Furthermore, 

rn,y[(t)) <2Y[(t)), toetli, neN+, ye y, (7.43) 

so that from Lebesgue’s dominated convergence theorem 

{r„j) = Eqj (^1^ r„,y) = 0, yeV. (7.44) 

Thus, for each y e 1^ there exists Uy e N+, Uy > |, such that 

Eqj (r^j,,y) < c, (7.45) 

for which let us denote Wy = Bny,y. For each x,yeV 

I rik 

|afc(x)-flfc(y)| < — (7-46) 

1=1 

so that for each yeV 

I 

sup |flfc(x)-flfc(y)| < — X.^(^k-0(Xk,i')r„ y(xitj]. (7.47) 

xeWy rik ^ 


From (7.43), Condition 140 holds for S = r„^,y, so that from Theorem 141, the right-hand side 
of (7.47) converges a.s. to Eq^ (r„^,y) as fc oo. Thus, from (7.45), for each y e C, a.s. for a 
sufficiently large k, 

sup I Uk (x) - (y) I < c. (7.48) 

X€.'Wy 

The family {Wy, y e C} is a cover of V. From the compactness of V there exists a finite set of 
points yi,..., ym e C such that {Wy. : t = 1,..., m} is a cover V, and a.s. for a sufficiently large k 
we have 

sup |flfc(x)-afc(y,')l <e, i = (7.49) 

X£Wy. 

From (7.38), for each xeV, Condition 140 holds for S = h{-, x), so that from Theorem 141, for 
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each xeV, a.s. lim^^oo = a{x). Thus, a.s. for a sufficiently large k 

\akiyi)-aiyi)\<e, i = (7.50) 

Therefore, a.s. for a sufficiently large k for which (7.49) and (7.50) hold, for each x e V, for 
some i e such that lyi-xl <5, 

Iflfc(x) - a(x)I < Iafc(x) - atiyi ]I +1flfc(y;) - a(y;)| + \a{yi) -a[x)\< 3e. (7.51) 

□ 


7.3 Locally uniform convergence of estimators 

In this section we apply the SLLNs from the previous section to provide sufficient conditions 
for the single- and multi-stage a.s. locally uniform convergence of various estimators from 
Section 4.2 as well as their derivatives to the corresponding functions and their derivatives. 
Such a convergence will he needed when proving the convergence and asymptotic properties 
of the minimization results of these estimators in the further sections. By we denote 
uniform convergence. For some A c |R^, we say that functions fn : A U, n e N+, converge 

loc 

locally uniformly to some function /: A — K, which we denote as /„ ^ f, if for each compact 
set KcA, fn\K ^ f\K, i-e. fn converges to / uniformly on K. 

Lemma 144. Let Z, m e N+, let Da IR^ he nonempty and compact, let functions f:D—^ R'” and 
s: R'” ^ R be continuous, and for some: D R™, n e N+, letfn ^ f. Then, s[fn) ^ s[f). If 

loc 

further s„ : R'” ^ R, n e N+, are such thatSn ^ s, then Snifn) ^ sif). 

Proof For M = sup^^g^) |/(x)| < oo let .RT = B;(0, M-r 1), and let c > 0. Since s is uniformly 
continuous on K, let us choose 0 < 5 < 1 such that |5(x) - s(y)| < e when \x- y\ < 6, x,y e K. 
Let AT e N+ he such that for n>N,\f„ (x) - fix) \<6, xeD. Then, for n > AZ we have | s{f„ (x)) - 
s(/(x)) I < c, X e D. Let further M e N+, M > Ai, he such that for n > M, 1(y) - 5(y) | < c, y e K. 
Then, for n> Mand xeD 

\Snifnix)) - Sifix))\ < |5„(/„(x)) - 5(/„(x))| + |s(/„(x)) - 5(/(x))| < 2c. (7.52) 

□ 

Until dealing with the cross-entropy estimators at the end of this section, we shall consider 
the LETS setting. Similarly as in Section 6.11, this will allow us to cover the special case of 
the LETGS setting and it is straightforward to modify the helow theory to deal with the ECM 
setting for A = R^ 

Theorem 145. Assuming Condition 32, if Condition 115 holds 

1. forS= 1, thena.s. (asn^oo)b^ [L'd^iL~^)[b))„ikn) converges locally uniformly to 0 
for veN^\ {0} and to 1 for v = 0. 


85 



Chapter 7. Minimization methods of estimators and their convergence properties 


2. forS = Z^, then a.s. h ^ d,,msq„(fo', h){K„) = iZ^L'dj,L[b))„{Kn) ^ d^msq, v e N^ 

^ /oc 

3. forS= C, then a.s. b dyCnib',b){Kn) ^ dyC, v e N, 

_ loc 

4. both for S equal to Z^ and then a.s. b — dymsq2„ib',b)iKic] ^ d^msq and b — 

loc 

dyVar„ib',b){Kn) ^ dyvar, veN, 

5. forS= C, S = Z^, andS= 1, then a.s. b^ dyic„{b',b)[K„) ^ dyic, veN . 

Proof. The first three points follow from such points of Theorem 123, Theorem 121 for p 2 = 1 
and appropriate pi, Remark 119, and from Theorem 131 (note that from Condition 115 for 
S = C we have such a condition for S = UfC oo)C). The fourth point follows from the first two 
points, the fact that a.s. {ZLfni^n) the last line in (4.23), (4.24), and Lemma 144. The 
fifth point follows from points three, four, and Lemma 144. □ 

Let us further in this section assume the following condition. 

Condition 146. A = r e N+, for each i e N+, ni e N+, and for each i e N, L,- e [0,oo) and 
Ki = {beU^:\b\<Li}. 

Consider the following conditions. 

Condition 147. For each ai, a 2 e IR+ 


“ exp(aiF(a2L,-i)) 

> -;-< oo. (7.53) 

Condition 148. lim,^ooij = oo. 

Remark 149. Let us discuss possible choices ofnt and Li such that conditions 147 and 148 hold 
for each r e N+, in some special cases of the LETS setting. Let A\ e N, A 2 e N+, m e N 2 , 0 < <5 < 1, 
and Bi,B 2 e IR+. Consider F{x) = y, which corresponds to X having multivariate standard 
normal distribution underVi (see sections 5.1 and 5.3.4). Then, one can take ni = Ai + A 2 m' 
and Li = {Bi+B 2 {i + 1) )2, or alternatively ni = Ai + A 2 i\ and Li = {Bi + B 2 {i +1))^. For some 

fli,a2 £ denoting bi = ^ ^ ^ the first case we have limi^oo hi = 7^ < 1 

^ 1 

and in the second case, using Stirling’s formula, we have lim,'^oo fi/ =0- Thus, in both cases 
(7.53) follows from Cauchy’s criterion. For F[x) = poiexpix) - 1), which corresponds to the 
Poisson case with initial mean po, one can take e.g. Li = Bi ln[B 2 + ln(t + 1)) and some ni as for 
the normal case above. 

Lemma 150. If conditions 68 and 69 hold, then for each p e [0, 00 ) and beU‘ 


(1(S 7 ^ 0)L{b)P) < exp{s[pF[R\b\) + F{Rp\b\))). 


(7.54) 


In particular, ifp > 1 then 

(1(S 7 ^ O)L(fi)P) < exp(2psF(i?p|fi|))). (7.55) 
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Proof. From (5.42) we have Qi a.s. that if S 5 ^ 0 (and thus from Condition 69, t < 5 ) then 
L[b)^ = expipiUib) + Hb)) 

= expipmb) + m-bp))j^^ (7.56) 

< expis[pF{R\b\) + FiRp\b\)))—^——. 

U-bp) 

Now (7.54) follows from E(Q)j (11(5 7 ^ 0 )^ 3 ^) = Q(-hp)(S 7 ^ 0) < 1. □ 

Condition 151. Conditions 68 and 69 hold and for some p e (l,oo), 5 e L^'^^(Qi). 

Theorem 152. If conditions 147 and 151 hold, then Condition 140 holds (for the same S, p, r, 
Ki, and n; as in these conditions and Condition 146). 

Proof. From Lemma 142 it is sufficient to check that for q as in that lemma corresponding to 
p from Condition 151, and for i?, as in (7.36), we have (7.37). From (7.55) in Lemma 150, it 
holds 


Ri= sup mS^0)L[bf''-^\\q 

bEKi^i 

< sup exp(2s(2r-l)F(f?(7(2r-l)|h|)) (7.57) 

bEKi-i 

< exp(2s(2r - l)FiRq{2r - l)Ij-i)), 

so that (7.37) follows from Condition 147 for ay = 2s{2r - 1) and 02 = Rq{2r - 1). □ 

Theorem 153. If conditions 134, 135, and 147 hold and Condition 151 holds for S=U (that 
is for S denoted as U), then for each w e IR and v eNf a.s. as k ^ 00 , b e ^ fkib) := 
iULibic-i)dj,iLib)'^))„^iXk) converges locally uniformly to b fib) := Eqj (f7dy(L(h)“')) e 

R. 

Proof Let M e U+, V = {x e : |x| < M}, hia),b) = f/(w)d^(I(h)“')(w), w e Qi, h e 1/, and 
Wico) = sup^gy |h(fc», h)|, (jJ e L2i. For some 1 < p' < p, from Remark 117 for 5 = and 
q = y, Condition 115 holds for such an 5. Thus, from Theorem 121 for pi = w and p 2 = 2rp' 
and from Remark 119, we have 

EQ^iW^''P') = EQ^isup iUdALib)'^))^’'P')<oo. (7.58) 

\b\<M 

Furthermore, if VC 7 ^ 0 then also 1/ 7 ^ 0, so that Condition 151 holds for 5 = VC and p = p'. Thus, 
from theorems 143 and 152 we receive that a.s. b — ftib) converges to / uniformly on V. □ 

Theorem 154. Let conditions 134, 135, and 147 hold. If Condition 151 holds 

1. then a.s. (L(hfc-i)S)„j, ixk) converges to Eq^ (S) (as k 00 ), 

2. for 5=1, then a.s. b iL{b]c-i)d:,[L[b)~^))„i^ixk) converges locally uniformly toO for 
veN^ \ {0} and to 1 for v = 0. 
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3. forS = Z^, then a.s. b ^ dpmsq„i^[b]c-i,b){Xk) ^ d^msq, v e l\l^ 

„ /oc 

4. forS=C, thenb ^ dvCn„{bk-\,b){Xk) ^ dyC, veN, 

_ ^ loc 

5. both for S= 1 andS = Z^, and ifn^ e N 2 , keN+, then a.s. b — dymsq2„^ibk-i, b)ixk) ^ 

loc 

d^msq andb ^ dyVdSn^{bk-i,b){Xk) ^ d^var, i;eN^ 

^ loc 

6. forS = C, S = Z^, andS= 1, and ifnk e N 2 , keN+, then a.s. b dyicn,,{bk-i,b)ixk) ^ 

dyic, vcnK 

Proof. The first point follows directly from Theorem 153 for r" = 0 and w = 0. Points two 
to four follow from Theorem 153 and points one to three of Theorem 123 (note that from 
Condition 151 for S = C we have such a condition for S = 11(C 5 ^ 00 ) C). The fifth point follows 
from point one for S = Z as well as points two, three, and Lemma 144, similarly as in the proof 
of the fourth point of Theorem 145. The shcth point follows from points four, five, and Lemma 
144. □ 

Let us now discuss single- and multi-stage locally uniform convergence of the cross-entropy 
estimators, for which we shall consider the ECM and LETGS settings separately. 

Theorem 155. In the ECM setting, let us assume Condition 32 and that we have C6.8J. Then, a.s. 


{ZL’)n[K„) a 


(7.59) 


and 


{ZL'X)n{kn)-*^q,[ZX). (7.60) 

Assuming further Condition 36, we have a.s. 

loc 

b^ dyCOnib',b){K„) ^ dyce, veN. (7.61) 

Proof. Formulas (7.59) and (7.60) follow from the SLLN. Under Condition 36, from (6.1) and 
(6.9) we have for veN^ 

dyCe{b)-dyCen{b',b){kn) = dyWib){a-iZL')„[kn))-[dyb^)[EQ^iZX)-{ZL'X)n{k„)). (7.62) 
Thus, for each compact KcA, from (7.59) and (7.60), 

sup Idy ce(h) - dycenib', b) (k„) | < sup Id^'P [b) \ [a - (ZL')n( k«)) 

b&K b^K 

+ sup\dyb^\{Eq^{ZX)-~(zUX)n[kn)) -> 0 

b^K 


(7.63) 

□ 
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Theorem 156. In the ECMsetting, let us assume that A = IR^ conditions 134,135, and 147 hold, 
and for some s>2we have Z e Then, a.s. 

lim = a, (7.64) 

k^oo 


lim {ZXL[bk-i))n,iXk) = Eqi(ZX), (7.65) 

k^oo 

and assuming further Condition 36, a.s. 

loc . 

b^dvcen„(.bk-i,b)ixk)^dyce, veN. (7.66) 

Proof. From H51der’s inequality, for each 2 < q < 5 we have < oo, i = 1,..., Z. Thus, 

(7.64) and (7.65) follow from the counterpart of the first point of Theorem 154 for ECM for 
S = Z and S = ZX respectively and (7.66) can be proved similarly as (7.61) in Theorem 155. □ 


Theorem 157. In the LETGS setting, let us assume conditions 32 and 91. Then, a.s. 


(ZGL')„(k„)->Eq^(ZG), (7.67) 


(Z//L')n(JC„)-EQ^(Z/f), 


(7.68) 


and 


loc 

b ^ dycenib',b){Kn) ^ dpce, veN. (7.69) 

Proof. Formulas (7.67) and (7.68) follow from the SLLN. From (6.39) and (6.47), dyceib) and 
dyCe„[b',b)[Kn) can be nonzero only for veN^ such thatX|=i Vi ^ 2. It is easy to check that 
for such a v, from (7.67) and (7.68), a.s. 


dyCeib)-dyCen{b',b)(K„) = dAb'^{E(}^{ZG)-iZGL’)„{K„))b+{EQ^{ZH)-{ZHL’)n{Kn))b)^0. 

(7.70) 

□ 


Theorem 158. In the LETGS setting, let us assume conditions 68, 69, 134, 135, and 147, and 
that for some p > 2 wehave Z e L''P[Qi). Then, a.s. 

[ZGL{bk-f)n, (Ifc) - EQ: (^G), (7.71) 
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(7.72) 


ClTld 


b^dpcen^:ibk-i,b){xk)^dpce, (7.73) 

Proof. From Lemma 92, for each 2< u< p and i e we have (|ZFf,|'’“) < oo and 

E(Qij(|ZG;j |''“) < oo. Thus, (7.71) and (7.72) follow from the first point of Theorem 154 for 
S = ZHi and S = ZGij respectively, and (7.73) can he proved similarly as (7.69) in Theorem 
157. □ 


7.4 Exact minimization of estimators 

In this section we define exact single- and multi-stage minimization methods of estimators, 
ahhreviated as ESSM and EMSM. We also discuss the possibility of their application to the 
minimization of the cross-entropy estimators in the ECM and LETGS settings. 

Let r c IR+ he unbounded and for some Z e N+, let 5 e ^(IR^) be nonempty. The ESSM 
and EMSM methods can be viewed as special cases of the following abstract method for 
exact minimization of random functions, which we call EM. In EM we assume the following 
condition. 

Condition 159. Foreachte T we are givena function ft'. SPiM, asetGt^ ^, 

and a B-valued random variable dt. Random variable ft [b,-) is denoted shortly as ft [b). 

Furthermore, it assumed that for each teT and w e Gf, dtiw) is the unique minimum point of 
b^ ftib.cj). 

Let us now define ESSM and EMSM. For some nonempty set A e 5§(R^), Ac B, and peN+, let 
us consider functions 

neNp. (7.74) 

For B = A these can be some estimators as in (4.13). We shall further often need the following 
condition. 

Condition 160. For each n eNp, a set D„ e SS{A) ^ is such that for each [b', 0 )) e D„, the 

function be B ib', b) (w) has a unique minimum point, denoted as h* lb', (v). 

In ESSM and EMSM we assume the following condition. 

Condition 161. Gondition 160 holds and for each n e Np,/or.^^ := {D„ r\D:De SS[A) iSi ^f}, 
the function [b', w) — h* [b', to) is measurable from 5^1^ = {Dn,^'n) to SA [B). 

In ESSM we also assume Condition 32 and the following condition. 

Condition 162. N u {oo}-valued random variables Nt, teT, are such thata.s. (2.7) and (2.10) 
hold. 
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Remark 163. In Condition 162 one can take e.g. T = N+ and = k, k e N+. Alternatively, 
one can take T = U+ and for some nonnegative random variable U onS^i, Nt can be given by 
formula (2.5) or (2.6) but for Ci = U[Ki), i e N+ (i.e.forS„ = U[Ki), n e N+J. In such cases 
sufficient conditions for (2.7) and (2.10) to hold a.s. were discussed in Chapter 2. For instance, 
such an U can be some theoretical cost variable, fulfilling U = pjjU for some pjy e IR+ and an 
practical cost variable U for generating some replicates (e.g. ofZ) under <Q' and doing some 
helper computations needed for the later estimator minimization. Such U and U are defined 
analogously as such costs C and C of an MC step in Chapter 2 and shall be called the cost 
variables of a step ofSSM. In such a case, some Nt as above can be interpreted as the number of 
steps ofSSM corresponding to an approximate theoretical budget t. Often one can take U = C, 
as is the case in our numerical experiments. 

For each t e T, in ESSM we define dt to be a B-valued random variable such that on the event 

Gt := {{Nt = e Np) A {{b',kk) e DO), (7.75) 

we have 

dt = bl{b',kk). (7.76) 

On G'f = D.\ Gt one can set e.g. dt = b', te T. 

In EMSM we assume that conditions 134 and 135 hold for n^ e Np, keN+. Furthermore, for 
each k eN+, d/cis Si B-valued random variable such that on the event 

Gic:={{bk-i,Xk)eD„fi (7.77) 

we have 


dk = Kff^k-i:Xk)- (7.78) 

On one can set e.g. dk = bo or dk = bk-i. 

Remark 164. ESSM and EMSM are special cases of EM for the respective Gt and dt as above, in 
ESSM for ft{b,oj) = 'i{Nt = k e Np)estk{b', b){k kiw)), while in EMSM for T = N+ and fk{b,of] = 
^lnffbk-l,b){Xk{Ot)))- 

In EMSM the variables bk, k eN, satisfying Condition 134 can be defined in various ways. 
An important possibility is when we are given some iCo-valued random variable bo, and bk, 
keN+, are as in the below condition. 

Condition 165. For each k eN+, ifdk^ Kk, then bk = dk, and otherwise bk = rk for some 
Kk-valued random variable rk. 

Note that if Kk c Kk+i, k eN, then for each e N+, in the above condition we can take e.g. 
rk = bo or rk = bk-i. 

Consider some function / : A — IR and let b* e Ahe its unique minimum point. We will be 
interested in verifying when some of the below conditions hold for EM methods, like ESSM 
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and EMSM under the identifications as in Remark 164, or for some other methods defined 
further on. 


Condition 166. Almost surely for a sufficiently large t^T,Gt holds. 

Condition 167. It holds a.s. limf^oo dt = b*. 

Condition 168. It holds a.s. ftidt) = fib*). 

Consider the following condition. 

Condition 169. A is open and Ki e ^(A), i e N, are such that for each compact set D c A, for a 
sufficiently large i,DcKi. 

Note that if conditions 146 and 148 hold, then Condition 169 holds. 

Remark 170. For EMSM let us assume conditions 165 and 169 (for the same sets Ki). Then, if for 
some compact setDc A a.s. d^^D for a sufficiently large k (which happens e.g. if a.s. d^^ b* 
and D is some compact neighbourhood ofb*), then a.s. for a sufficiently large k, d^ = b^. In 
particular, if additionally Condition 167 or 168 holds for EMSM then such a condition holds 
also for die replaced by b^. 

Let us now describe how ESSM and EMSM can he used for ^t„ = ce„ in the ECM and LETGS 
settings. Let us first consider ECM as in sections 5.1 and 6.1, assuming conditions 35 and 
36, as well as that we have (6.8), a > 0, and (6.10). Then, from the discussion in Section 6.1, 
Condition 160 holds for 

_ {ZL!X) 

D„ = {ib',a)) e A X Cf : (ZL')n(w) > 0 A " (w) e p[A]}, (7.79) 

iZU)„ 

and from formula (6.7), Condition 161 holds. In ESSM, from (7.59) and (7.60) in Theorem 155 
as well as from a > 0, a.s. for a sufficiently large n we have iZL')„ik„) > 0 and a.s. 


iZL'X)n,^^, Eq,(ZX) 

{Kn) -. (7.80) 

iZL')n OC 

Thus, using further (6.10), the fact that ;u[A] is open, and Condition 162, a.s. for a sufficiently 
large t, Gt as in (7.75) holds (i.e. Condition 166 holds for ESSM), in which case 


dt = p^ 


(Zl'X)fc „ 

_ (Kfc) 

iZK)k 


(7.81) 


From Condition 162, (6.11), (7.80), (7.81), and the continuity of p~^, Condition 167 holds. For 
EMSM let us additionally make the assumptions as in Theorem 156. Then, from (7.64) and 
(7.65) in that theorem, hy similar arguments as above for ESSM, conditions 166 and 167 hold 
for EMSM. 

Consider now the LETCS setting and, using the notations as in Section 6.5, let us assume that 
Condition 91 holds and A is positive definite. From the discussion in that section. Condition 
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160 holds for e A x Q” : is positive definite}, which, for Z > 0, fulfills 

D„ = Ax {(oeD." : r„((y)}. From formula (6.41), Condition 161 holds. In ESSM, from the SLLN 
a.s. An{b'){Kn) A and B„{b']iK„] — B. Thus, from Lemma 81 and Condition 162, a.s. for a 
sufficiently large t, Nt = and is positive definite (i.e. Condition 166 holds), 

in which case dt = iA„{b'){K„))~^B„{b'){K„]. Thus, from (6.48), Condition 167 holds. For 
EMSM, let us make the additional assumptions as in Theorem 158, so that from (7.71), a.s. 
A„j,(fifc_i)(jfc) A, and from (7.72), a.s. B„^.ib]c-i)iXk) ^ B. Then, analogously as for ESSM 
above, conditions 166 and 167 hold for EMSM. 


7.5 Helper theorems for proving the convergence properties of min¬ 
imization methods with gradient-hased stopping criteria 

Condition 171. Fora random variable Y with values in a measurable space ^ and a nonempty 
set A e a function r : S^[A) ® ,5^(IR) is such that Condition 94 holds for h{b,-) = 

r{b,Y{-)),beA. 

For fl e andc e IR+, we define a sphere Siia,e) = {xeU^ :\x- a\ = e}, a hall Bi{a,e] = {x e IR^: 
\x- a\ < c}, and a closed hall Biia,e) = Bi{a,e) = {xeU‘ :\x- a\ < e}. The proof of the helow 
lemma uses a similar reasoning as in the proof of consistency of M-estimators in Theorem 
5.14 in [55]. 

Lemma 172. Let Condition 171 hold, Fi, F 2 , ... be i.i.d. ~ Y , b e A ^ f„{b) := ^ '^1=1 
neM+, K c A be a nonempty compact set, and m be the minimum of f on K (which exists 
due to lemmas 93 and 95). Then, for each a e (- 00 , m), a.s. for a sujficiently large n, fn{b)> a, 
heK. 


Proof Let Ui = B;(0, i“^), i e N+. From the a.s. lower semicontinuity of fi — r(fi, F), for 
each V ^K, for g^vix) = infj,e{^+[/,}n^ r[b,x), we have a.s. gi^viY) ] r[v, F) as / ^ 00 . Thus, 
from the monotone convergence theorem, E(g; j,(F)) ] fiv) as I 00, v e K. In particular, 
E(g/^y(F)) > a for l>lv for some e N+, v e K. The family {Dp := v + Ui^ : v e K} is a cover 
of K. From the compactness of K, let [Dp^Dp^} he its finite subcover. Then, from the 
generalized SLLN in Theorem 1 (which can be used thanks to (6.49)), 


1 " 

inf/„(fi)> min -Y^gup^ 
bfiK fc£{l,...,m} n 


iYi)' 


min E(g 7 

fc£{l. m} ''k 


p,iY))>a. 


(7.82) 


□ 


Lemma 173. Let Condition 171 hold for r equal to some nonnegative ri and r 2 ,for the same 
Y and A. Let g{b) = E(ri(fi, F)) and E(r 2 (fi, F)) = 1, fi e A, let Y\, F 2 ,... be i.i.d. ~ Y, and 
letbeA^ fi,nib) := ri{b,Yj), i = 1,2, andbe A^ g„{b) := /i,„(fi)^,n(fi), ne N+. Let 

Kc A be a nonempty compact set and m be the minimum ofg onK. Then, for each a e (- 00 , m), 
a.s. for a sufficiently large n, 

gn{b)>a, beK. (7.83) 
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Proof. It holds g„(b)>Q, neN+, be A, so that it is sufficient to consider the case when m > 0 
and Q< a< m. Let a< d<m. Then, from Lemma 172, a.s. for a sufficiently large n, fi,n{b) > d 
and f 2 ,nib] > ^, be K,m which case (7.83) holds. □ 

Condition 174. We have b* e and A e is a neighbourhood ofb*. A function f\A—>-U, 
/ > -oo, is lower semicontinuous and b* is its unique minimum point (in particular, f{b*)< 
ooj. 

Condition 175. Condition 174 holds and B is such that Ac B. Functions /„ : 5 — K, 
n e N+, fulfill 

\im fnib*) = f[b*). (7.84) 

n^oo 

Furthermore, for each compact set K c A, for m equal to the minimum off on K, for each 
a< m, for a sufficiently large n, fn (x) > a, xeK. 

Remark 176. Let Condition 174 hold, B cU^, Ac B, and /„ : B — K, n e N+ he such that 

loc 

fn\A =1 /. Then, Condition 175 holds. 

Remark 177. Let us assume Condition 175, lete e IR+ be such that Bfib* ,e) c A and let c be 
the minimum off on Sfib* ,e). From the uniqueness of the minimum point b* off, we have 
c> fib*). Lets eU+ be such that c> fib*)+ 5. Then, for a sufficiently large n 

f„ib)>fib*) + 5, beSiib*,e) (7.85) 


and 

fnib*)<fib*) + -. (7.86) 

Theorem 178. Let us assume that Condition 175 holds for a convex B and for ff, neN+, which 
are convex and continuous. Then, for a sufficiently large n, /„ possesses a minimum point 
Un&B. Furthermore, 

lim a„ = b* (7.87) 

n^oo 

and 


\\m fnian) = fib*). (7.88) 

n^oo 

If further B is open, ff, n e N+, are differentiable on B, and a sequence h„ e B, n e N+, is such 
thatlimn^oo Nfnibn)\ = 0, then 

lim bn = b* (7.89) 

n^oo 

and 


lim fnibn) = fib*), 
n^oo 


(7.90) 
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Proof. Let us consider e, 5 e IR+ as in Remark 177. From this remark, let AT e N+ be such that 
forn> N we have (7.85) and (7.86). Then, for n> N, for each be B such that \b-b*\>e, from 
the convexity of /„ 


fnib)-f„{b*)> 


For n> N, from (7.91) 

\an-b*\<e. 

This proves (7.87). For n> N, from (7.86) and fnidn) ^ fnib*) we have 

fn{an)<f{b*) + -. 

From Condition 175, for some Ni > N, for n > Ni, 
f„{b)> fib*)-beBiib*,e). 


\b-b*\ b-b* 

[if„ib*+e——)-f„ib*]) 


e 

\b-b*\d 

2e 


|h - h*r 


> 0 . 


and the continuity of f„, fn has a minimum point a„ fulfilling 


(7.91) 


(7.92) 


(7.93) 


(7.94) 


Thus, for n> Ni, from (7.92), (7.94), and (7.93), we receive that \fnian)- fib*)\ < f. Since we 
could have selected 5 arbitrarily small, we receive (7.88). 


Let B be open and f„ be differentiable. Then, for be B such that bi^b*, for v= from 

the convexity of /„ 

|V/„(h)|>V,/„(h)> ^”^^^~{';|^ \ (7.95) 

\b-b*\ 

Thus, for each h e R for which |h - h* | > c, for n > A(, from (7.95) and (7.91) 

|V/„(h)|>|-. (7.96) 

2 c 

Let N2> Nhe such that for n> N2 

\yfnibn)\<-^. (7.97) 

2 c 

Then, from (7.96), for n> N2 

\bn-b*\<e, (7.98) 
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which proves (7.89). For n> NyW N 2 we have 

s s 

- = —e>\Mfn[hn)\\hn-b*\>fn{bn)-fn{b*) 

^ ^ (7.99) 

> fnibn)-f{b*)-->-5, 

where in the first inequality we used (7.97) and (7.98), in the second (7.95), in the third (7.86), 
and in the last one (7.94). Thus, in such a case 

5>Ubn)-f[b*)>--, (7.100) 

which proves (7.90). □ 

Lemma 179. Let Ac |R^ be open. If a twice continuously differentiable function f: A—^R has a 
positive definite Hessian on A, then for each convex U c A such that for some compact Kc A, 
U c K, f is strongly convex on U. If further Unix] a fix) = 00, then for each xqC Aas such a U 
one can take the sublevel set S = {xeA\ fix) < /(xo)}. 

Proof From Lemma 80, b e A ^ m;(V^/(h)) is continuous and thus / is strongly convex 
on U with a constant infx£ 7 <: m;(V^/(x)) > 0. From the convexity of /, S as above is convex. 
Furthermore, ii\inix\Afix) = 00 , then for a sufficiendy large M, for a compact set K as in (6.51) 
we have Sc K. □ 

7.6 Minimization of estimators with gradient-based stopping crite¬ 
ria 

In this section we define single- and multi-stage minimization methods of estimators with 
gradient-based stopping criteria, abbreviated as GSSM and GMSM respectively. We also 
discuss the possibility of their application to the minimization of the well-known mean square 
estimators in the LETGS setting and both the well-known and the new mean square estimators 
in the ECM setting. 

Consider some sets T and B as in Section 7.4 and let additionally such a B be open. GSSM 
and GMSM are special cases of the following minimization method of random functions 
with gradient-based stopping criteria, abbreviated as GM. In GM we assume Condition 159. 
Eurthermore, we assume that b — ffib, w) is differentiable, f e T, to e Qi, and that we are given 
[0, 00 ]-valued random variables Cf, teT, such that a.s. 

limef = 0 (7.101) 

t^OO 


and 


|Vb7f(£tr(to),to)| <Cf(to), cveGt,teT. 

We shall further need the following conditions and lemmas. 


(7.102) 


96 



7.6. Minimization of estimators with gradient-based stopping criteria 


Condition 180. Condition 160 holds, for each neNp and [b', to) e beB ^ estnib', b) (to) is 

differentiable, and b'^{b',(ji)) is equal to the unique pointer. B such thatV },estnib', c){to) = 0. 

Lemma 181. Condition 180 implies Condition 161. 

Proof. A function (b, {b',(t))) £ B x Dn^ gib, (&',to)) := Vj,est„(fo', b)[(i)) is measurable from 
St^iB) i8> {Dn,^'n) to ^(D?^) and for each D e SSiB), ibf)~^{D) = {(b',to) e D„ : there exists c e 
D, such that gic, (h',to)) = 0} is a projection of g“H0) n (D x Dn) e onto the second 

coordinate. Thus, {h*)“^ (D) e DeSSiB). □ 

Condition 182. ThesetB is convex. Furthermore, for each n eNp, a set D„ e issuchthat 
for each b' £ A and to £ D„, b £ B ^ gib) := est^Ch', h)(to) is smooth with a positive definite 
Hessian on B, and lim^j b gib) = 00 . 

Lemma 183. Condition 182 implies Condition 180forb’^ib',a)) as in Condition 160 andD„ = 
A X Dn, n £ Np. 

Proof. It follows from lemmas 39 and 98. □ 

Except for some differences mentioned below, we define GSSM and GMSM in the same way as 
ESSM and EMSM in the previous section. The first difference is that in GSSM and GMSM we 
additionally assume that Gondition 180 holds for B as above and we consider [0, 00 ] -valued 
random variables Ct, t£T, such that a.s. (7.101) holds. Furthermore, in GSSM, for f e T, on Gt 
as in (7.75), instead of (7.76) we require that |Vi,^t/t(h',df)(Kfc(a»))| < Cj, while in GMSM, for 
k£N+, onCk as in (7.77), instead of (7.78) we require that N b^tn^ibk-\, dk)iXk) \ ^ £k- 
Note that GSSM and GMSM are special cases of GM under the identifications as in Remark 
164. Such identifications shall be frequently considered below. From Lemma 181, for Ct = 0, 
t£T, GSSM and GMSM become special cases of ESSM and EMSM respectively. 

Remark 184. Let us discuss how one can construct the variables dt, t£T, in GSSM and GMSM, 
assuming that the other variables as above are given. Let t£T. From Assumption 180, on an 
arbitrary event At contained in the appropriate Gt as above, like At = Gt or At = Gf n {ct = 0}, 
we can take in GSSMdt = b*ffb' ,kk) and in GMSM dt = bnfbt-i,Xt)- Note that from Lemma 
181, in both these cases dt is measurable on At. Unfortunately, in the examples discussed below 
such dtiv)), (X) £ At, typically cannot be found in practice. Letnoww £ LI be such thatctiw) > 0. 
Then, under some additional assumptions onb^ g[b) := ftib,(ji)), dtiw) in GSSM or GMSM 
can be a result of some globally convergent iterative minimization method (i.e. one in which 
the gradients in the subsequent points converge to zero), minimizing g, started at xq equal to b' 
in GSSM or bt-\{(i)) in GMSM, and stopped in the first point dtiw) in which (7.102) holds. As 
such an iterative method one can potentially use the damped Newton method, for the global 
and quadratic convergence of which it is sufficient ifg is strongly convex on the sublevel set 
S := {x £ B : gix) < g(xo)}, g is twice continuously differentiable on some open neighbourhood of 
such an S, and the second derivative ofg is Lipschitz on S (see Section 9.5.3 in [9]). From Lemma 
179, such assumptions hold in the above discussed GSSM and GMSM methods if Condition 182 
holds, we consider the corresponding Dn, n £ Np, as in Lemma 183, and we havew £ Gt. See 
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[9] and [41] for some other examples of globally convergent minimization methods requiring 
typically weaker assumptions. In Remark 188 below we discuss a situation when one can 
perform some minimization method of a g as above for each weD. such that eicj) > 0. For most 
iterative minimization methods, including the damped Newton method, if the same method 
is used for each w in some event Bt contained in {Cf > 0}, then the fact that the resulting dt is 
measurable on Bt follows from the definition of the method. On G\ one can define dt in similar 
ways asforESSM orEMSM in the previous section. 

Condition 185. The above setB is an open convex neighbourhood of some b* e and e e IR+ 
is such that B lib* ,e) e B. Furthermore, for some esin tis in (7.74) for some n e Np, b' e A, and 
to e Q” are such that 

inf estfiib',b)iti)) > estnib',b*)ioj). (7.103) 

b^Si(b'‘,e) 

The following remark will be useful for proving the convergence properties of the GM methods 
in the below examples. 


Remark 186. Consider the LETGS setting. Then, from Theorem 112, if Condition 185 holds for 
^t„ = msq„, then for a = b* and each beU^ \ {0} we cannot have (6.76) for 



and thus r„(to) holds. Let us now consider theECM setting. Then, if Condition 185 holds for 
= gvai.n (see (6.60)) then for a = b* and each beU^ \ {0} we cannot have (6.66) for t as in 
(7.104). Thus, from Theorem 110, in such a case system (6.65) has only the zero solution. 


For the GSSM and GMSM methods in the below examples we shall discuss when Condition 
182 holds in them and we consider Condition 180 holding in them as a result of Lemma 183. 
For GMSM in all the below examples we assume that conditions 146 and 147 hold (where in 
the ECM setting we mean the counterparts of these conditions). 

Let us first discuss GSSM and GMSM for ^t„ = rnsq„, n e N+, in the LETGS setting. From 
Theorem 112, we can and shall take in Condition 182, = {to e Q" : r„(to)}, n e N+. Let us 

assume conditions 52 and 76 and that for some be A, msq(fi) < oo, so that from Theorem 114, 
we can and shall take in Condition 174, / = msq. In GSSM, from Condition 162 and Theorem 
82, Condition 166 holds. From the SLLN and Lemma 172 for 


rib,x) = iZ^L'Lib))ix], fieA, xeOi, 


(7.105) 


and Yi = ki, i e N+, for P a.e. to e G, Condition 175 holds for B = ^and/„(fi) = insq„(fi', fi)(k„(to)) 
Thus, from Theorem 178, (7.102), and (7.101), conditions 167 and 168 hold. For GMSM let us 
assume that Condition 151 holds for S = Z^. Then, from the third point of Theorem 154 and 
remarks 176,177, and 186, a.s. for a sufficiently large k, tnfiXk) holds, i.e. Condition 166 holds. 
Thus, from (7.101), (7.102), and Theorem 178, conditions 167 and 168 hold too. 

Let us now consider the ECM setting, assuming the following condition. 
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Condition 187. Conditions 35 and 36 hold and f = msq satisfies Condition 174. 

From Lemma 98, / = msq satisfies Condition 174 for instance when / = msq satisfies Condi¬ 
tion 97, which due to Lemma 106 holds e.g. if for some be A, msq(h) < oo, and 


Ql(Pmsq)>0. (7.106) 

Let us first consider the case of ^tn = rnsq„, n e N+, for which let us assume (7.106). From re¬ 
marks 103 and 107, we can and shall take in Condition 182, = {to e Q” : Pmsqi^^i) holds for some i e 

{1. n}}, n e N+. In GSSM, from the SLLN, a.s. lim„^oo (Pmsq)„ii<n) = Qi(Pmsq)> so that from 

(7.106) and the SLLN we have a.s. kjc e Dj^ for a sufficiently large k. Thus, from Condition 162, 
Condition 166 holds. Using further Lemma 172 for r as in (7.105) and Theorem 178, conditions 
167 and 168 hold too. In GMSM, let A = Since the counterpart of Condition 151 for ECM is 
fulfilled for S = ll(pmsq). from the first point of Theorem 154, a.s. (L(fifc_i)S)„j,(2^fc) Qi(pmsq)- 
Thus, from (7.106), Condition 166 holds. Let us assume Condition 151 for S = Z^. Then, from 
the third point of Theorem 154 and Theorem 178, conditions 167 and 168 hold. 

Let us now consider for each n e N 2 

^t„{b',b) = m^n[b',b)\=\g^aT,nib',b) + -liZL')^)n, b' e A, beB = U^ (7.107) 

n 

(see (6.60)). Then, from (6.61), (6.54), and (6.53) 

msq2„(fi',fi) = ins^„(h',fi), b',beA. (7.108) 

Note that does not depend on b. Thus, from Theorem 110, we can and shall take 

in Condition 182, D„ = {to e L2” : system (6.65) has only the zero solution}. In GSSM, from 
the SLLN and Lemma 173 for ri(b,y] = [Z^L'L[b))[y), r 2 [b,y) = j^iy), beU^, ye Qi, and 
Yi = Ki, t e N+, for P a.e. to e Q, Condition 175 holds for B = A and fnib) = msq2„(fr, h)(7f„(to)), 
be A, and thus from (7.108) it holds also for B = M.‘ and fnib) = msq2„(fi', h)(k„(to)), be B. 
Therefore, from remarks 177 and 186 and Condition 162, Condition 166 holds. Using further 
Theorem 178, conditions 167 and 168 hold as well. In GMSM, let A = and let us assume 
that the counterpart of Condition 151 for ECM holds for S = Z^ (note that for S = 1 it holds 
automatically). Then, from the fifth point of Theorem 154 and from remarks 177 and 186, 
Condition 166 holds. Using further Theorem 178, conditions 167 and 168 hold as well. 

Remark 188. Checking ifCt holds in possible practical realizations of GSSM or GMSM methods, 
as it can be done when using the damped Newton method as discussed in Remark 184, may 
be inconvenient. For instance, for = tnsq„ in theLETGS setting or ^tn as in (7.107) in the 
ECM setting as above, this typically cannot be done precisely due to numerical errors, and one 
has to make a rather arbitrary decision when such a condition holds approximately. From 
the below discussion, in the latter case one can avoid checking ifCt holds and perform some 
minimization method of a g as in Remark 184 for each cj ekl such thateio)) > 0. From the 
Zoutendjik theorem (see Theorem 3.2 in [41]), for a number of line search minimization methods 
of a function g:B-^U started at xqeB to be globally convergent it is sufficient if g is bounded 
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from below and continuously differentiable on some open neighbourhood jY of the sublevel 
set{xeB:g[x) <g(xo)}, andifVg is Lipschitz on jY. In particular, it is sujficient if in addition 
to the boundedness from below, g is twice differentiable and 11 V^g| |oo is bounded on such an 
JY. One of the methods for which this holds is gradient descent with step lengths satisfying the 
Wolfe conditions; see [41]. Note that from (6.60), (6.63), and Wvv'^Woo = I t'P, v e ufforo) e Q” 
andK = max,je{i^ | Vjj{(x))\^, we have for each b’ e A and beU^ 


\\'Ylgvai,nib’,b){Ct)]\\oo< '^{Z^L’){(t)i) ^ L’{(VfWVjjiO)) Vjjicof \\ooexp{b'^ VjjiO))) 

i=l 


<Kgvar,nib’,b){a)). 


(7.109) 


Thus, for ^tn as in (7.107) it also holds ||V|^t„(fo',h)(to)||oo ^ K^Xnib',b){(ji)). From this it 
follows that for g as in Remark 184 corresponding to the GSSM or GMSM methods for eit„ as 
above, for each xo e and 5 e K+, the assumptions of the Zoutendjik theorem as above hold for 
JY = {xeU^: g{x) < g(xo) + <5}. 


7.7 Helper theorems for proving the convergence properties of multi¬ 
phase minimization methods 

Theorem 189. Let U be an open ball with a center b* and f :U —^Ube strongly convex 

with a constant s e IR+. Letfy :U ^U,neN+,be twice differentiable and such thatV^fy V^/. 
Then, for each 0<m< s, for a sufficiently large n, fy is strongly convex with a constant m. Let 
further b* as above be the minimum point off andV fy V/. Then, for a sufficiently large n, 
fn possesses a unique minimum point an, which is equal to the unique point x e U for which 
M fnix) = 0 , and each be U is a ^\y fyib)]^ -minimizer of fy- Furthermore, lim„^oo an = b*. 

Proof. Let 0 < m < 5. From Lemma 80, for the sufficiently large n for which ||V^/„(x) - 
V^/(x) I loo < 5 - m, X e t/, we have m; (V^/„ (x)) > m, xeU, so that fy is strongly convex with a 
constant m. Under the additional assumptions as above, let hn = fn + fib*) - fyib*), n e N+. 
Then, hnib*) = fib*) and = V/„ ^ V/, so that hn /. Furthermore, since = V^/„, 
n eN+, hn is strongly convex for a sufficiently large n. Thus, from Remark 176 and Theorem 
178, for a sufficiently large n, hn and thus also fy possesses a unique minimum point and 
lim„^oo an = b*. The rest of the thesis follows from the discussion in Section 6.9. □ 

Condition 190. A function f -.U^ U is continuous, functions fy'.U^ ne N+, are such 

loc 

that fn /, and for a sequence dn e IR% n e N+, we have 

\undn = d*euK ( 7 . 110 ) 

n^oo 

We have the following easy-to-prove lemma. 

Lemma 191. If Condition 190 holds, thenlmin^oo fnidn) = fid*). 
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Theorem 192. Assuming condition 190, let for a bounded sequence s„ n e N+, it hold 
fniSn) < fnidn), neN+. Then, 

\imsup f[Sn)< fid*). (7.111) 

n^oo 


Let further d* e IR^ be the unique minimum point of f. Then, 
lim fisn) = fid*). 

n^oo 

If further f is convex, then 
lim Sn = d*. 

n^oo 


(7.112) 


(7.113) 


loc 

Proof. Let e > 0. From the boundedness of the set D = {s„: neN} and f„ ^ f, let A^i e N+ be 
such that for n > Ni, \fnix) - /(x)| < |, x e D. From Lemma 191, let Nz > Ni be such that for 
n > Nz, \fnidn) - fid*)\ < |. Then, for each n > Nz, 

fiSn) < fniSn) + ^ < fnidn) + | < fid*) + e, (7.114) 

which proves (7.111). Let d* be the unique minimum point of /. Then, (7.112) follows from 
fiSn) > fid*), ne N+, and (7.111). Let now/be convex andd e IR+. Then, from the continuity 
of /, there exists xo e S/ id*, 5) such that /(xo) = mfx£s,(d*,(5) fM- From the uniqueness of d*, 
m := fixo) - fid*) > 0. From the convexity of /, for x e such that |x- d* | > 5 we have 

fix) - fid*) > \ fid* + 5^^^) - fid*)) > m. (7.115) 

0 |x-a*| 

Thus, when (7.114) holds for e <m, then we must have \Sn - d* \ < 6, which proves (7.113). □ 

7.8 Two-phase minimization of estimators with gradient-hased stop¬ 
ping criteria and constraints or function modifications 

In this section we describe minimization methods of estimators in which two-phase mini¬ 
mization can be used. In their first phase one can use some GM method as in Section 7.6 
and in the second phase e.g. constrained minimization of the estimator considered or uncon¬ 
strained minimization of such a modified estimator, using gradient-based stopping criteria. 
The single- and multi-stage versions of these methods shall be abbreviated as CGSSM and 
GGMSM respectively. We also discuss applications of these methods to the minimization of 
the new mean square estimators in the LETGS setting and the inefficiency constant estimators 
in the EGM setting for C = 1. 

Let us further in this section assume that A = U^ and that the following condition holds. 


Condition 193. For some c e IR+, functions gi, g 2 : SSiU^) are such that for each x e R^ 
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gi(x) isopen, 

Biix,£)c giix), 


(7.116) 


gi(x)cg2(x), (7.117) 

and for each bounded set the set[jx^B isbounded. 

CGSSM and CGMSM will be defined as special cases of the following GGM method. In GGM we 
assume that for some unbounded T cU+, for each teT we are given a [0, oo] -valued random 
variable It, an A-valued random variable dt, a function dt'.O.-^ A, and a random function 
ft (G,.^) 51 {W), such that b — ft{b,a)) is differentiable, w e Q, we have 

dtegzidt), (7.118) 


ft{dt)<ft{dt), 
and if dt e giidt), then 


(7.119) 


\VbMdt)\<et. (7.120) 

Furthermore, we assume that Gondition 167 holds for the above variables dt, teT, and some 
b* e A. 

Remark 194. Functions dt as above always exist, assuming that the other variables as above 
aregiven. Indeed, without loss of generality let It = 0. Then, if dt fulfilling (7.118), (7.119), and 
dt € gi [dt) does not exist, then dt can be chosen to be a minimum point ofb e gi [dt) ftib), 
which exists due togi(df) being compact (see Condition 193) and ftib) > ftidt), b e d giidt) c 

g2(<^f)\gl(<^f)- 

Gonsider some functions est^, k e Np, as in (7.74) such that b — estjdb’, b)ia)) is differentiable, 
b' e IR^ to e Qp k e Np. 

Definition 195. CGSSM is defined as CGM in which conditions 162 and 32 hold and ftib) = 
'iiNt = ke Np)estfc(fi', b)ikic), teT. CGMSM is defined as CGM in which T = N+, Condition 134 
holds. Condition 135 holds for nk e Np, fce N+, and wehave f^ib) = ^lnfib]c-i,b)ixk)> keN^.. 

The following condition is needed e.g. if we want to investigate the asymptotic properties of 

dt, teT. 


Condition 196. The functions dt, teT, are random variables (i.e. they are measurable func¬ 
tions from (Q, ^) to 55 [A)). 
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Remark 197. Whenever dealing with some setDeD. for which it is not clear ifD e when 
trying to prove that¥’{D) = 1 and in particular De^, we shall implicitly assume that we are 
working on a complete probability space, so that to achieve the goal it is sufficient to prove 
that¥’{E) = 1 for some E such that E c D. Such a D will further typically appear when 
considering functions gt'.kl—^ IR^ like dt as above, without assuming that they are random 
variables. For instance, for some b* e IR^ we will consider D = {to e Q : limt^oo gt(^) b*} or 

D = {(t) £ n: gdvj] = b* for a sufficiently large t}. 

Condition 198. It holds Ctiw) > 0, t £ T, w £ kl, and we are given a function R : 5^[A) 
5^[{e,oo)) such thatR[x) <oo, Me K+. 

As the function R in the above condition one can take e.g. R{x) = a\x\ + b for some a £ (0,oo) 
and b £ {c,oo). 

Remark 199. Let us assume Condition 198. Then, using e.g. boxes gi (x) = {y e : lx, - y, | < 
R[x), i = l,...,l}, or balls giix) = Biix,Rix)), and gzix) = gj(x), xe A, for each t£T, under 
some additional regularity assumptions onb^ ft{w,b),(x)£kl (which in the case ofCGSSM and 
CGMSM reduce to appropriate such assumptions on estfc, k e NpJ, dt as above can be a result 
of some constrained minimization method of the respective ft [b), started at dt, constrained 
to g 2 idt), and stopped in the first point dt in which the respective requirements for CGM as 
above are fulfilled. See e.g. [41, 12, 13] for some examples of such constrained minimization 
algorithms (also called minimization methods with bounds when box constraints are used). 
In such a case (and assuming that the same minimization algorithm is used for each to e tlj 
Condition 196 typically holds and can be proved using the definition of the algorithm used. 

Consider the following condition. 

Condition 200. It holds S £R+ andh:A—^ [0,oo) is a twice continuously differentiable function 
such thath(x) = 0/or x e BfiO, 1) and h{x) > 1 for |x| > 1 + 5. 

An example of an easy to compute function fulfilling Condition 200 is /z(x) = 0 for |x| < 1 and 
h(x) = for |x| > 1. 

Remark 201. Let us assume conditions 198 and 200, andletff be nonnegative, t£T. Letgiix) = 
Bi{x,R{x)) andg 2 {x) = R/(x,i?(x)(l + 5)), x£ A. Then, under some additional assumptions on 
h — /f(h,to), to e kl, rather than using constrained minimization as in Remark 199, to obtain 
dt in CGM one can use some globally convergent unconstrained minimization method the 
following modification offf 

b - htib) = Mb) + 7r (7.121) 

R[dt) 

Such a method could start at dt and stop in the first point dt in which 


hfidt) < htidt) 


(7.122) 
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and ifdt e g\{dt), then 


\Vbhtidt)\<et. (7.123) 

Sufficient assumptions for the global convergence of a class of such minimization methods are 
given by the Zoutendjik theorem, as discussed in Remark 188. Such assumptions are fulfilled in 
the above case if we have twice continuous differentiability ofb — fiib, (v), wen, and h, which 
is why we assumed the latter in Condition 200. 

Let us check that the assumptions ofCGM are satisfied for such constructed dt. If ftidf) = 0, 
then from ft being nonnegative, itholdsVbftidt) = 0, and thus dt = dt and wehave (7.118). If 
fidt) >0, thenfrom (7.121) and Condition 200, htib) > hfidfl for\b- dt\ > il + 6)R{dt), and 
thus from (7.122) we also have (7.118). From (7.121) wehavehtldt) = ftidf and ffidt) < hfidfi, 
so that from (7.122) we have (7.119). Finally, ifdt £ giidfi, thenfrom (7.123) andSI b^fix) = 
VbftM, xe giidt), wehave (7.120). Similarly as in Remark 199, Condition 196 typically holds 
for such constructed dt. 

Consider the following condition, which will he useful for proving the asymptotic properties 
of minimization results of CGSSM, CGMSM, and some further methods. 

Condition 202. Almost surely for a sufficiently large t, (7.120) holds. 

The following theorem will he useful for proving the convergence properties of CGM methods. 

Theorem 203. Let us assume that Condition 190 holds for and f which is convex and 
has a unique minimum point d*. Letdn e giidfi be such that ffidn) < fnidn), n e N+. Then, 

nmd„ = d* (7.124) 

n^oo 


and 


\im fn{dn) = f{d*). (7.125) 

n^oo 

Let further f be twice continuously differentiable with a positive definite Hessian on A and 
let fn, n £ N+, be twice differentiable and whose ith derivatives for i = 1,2, converge locally 
uniformly to such derivatives off. LetCn > 0, n e N+, he such that lim^^oo £n = 0. Let for n e N+ 
it hold that ifdn e gi idn) then 

|V/„(d„)|<e„. (7.126) 

Then, for a sufficiently large n, (7.126) holds. Let further D be a bounded neighbourhood ofd*. 
Then, for a sufficiently large n, fn\D bas a unique minimum point equal to a unique dn^D 
such thaty fnidn) = 0. 

Proof. From (7.110) and Condition 193, thesetU?j£N+ gi^dfi is hounded and so is the sequence 
(dn)ra£N+- Thus, (7.124) and (7.125) follow from Theorem 192. From (7.110) and (7.116), for 
a sufficiently large n, dn e Bfid*, |) c giidfi, and thus (7.126) holds. The rest of the thesis 
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follows from Theorem 189, in which from Lemma 179 as U one can take any open hall with 
the center d*, such that DcU, and as / and /„ in that theorem use restrictions to U of the 
above / and /„. □ 

For CGMSM in the below examples we assume conditions 146 and 147. By saying that the 
counterparts of conditions 167 or 168 hold for CGSSM or CGMSM or that Condition 165 holds 
for CGMSM, we mean that these conditions hold for dt and ft replaced by dt and ft, in the 
counterpart of Condition 165 additionally assuming that Condition 196 holds. 

Remark 204. Note that we have a counterpart of Remark 170 with djc replaced by d^ and 
conditions 167, 168, and 165 replaced by the counterparts of such conditions for CGMSM. 

In the below CGSSM methods let us assume that Condition 115 holds for S = Z^, while for 
the CGMSM methods that Condition 151 holds for S = (where when considering the ECM 
setting we mean the counterparts of these conditions), in the LETGS setting additionally 
assuming these conditions for S = 1. 

Let us now consider CGSSM or CGMSM in the LETGS setting, assuming conditions 52 and 76, 
that b* as above is a unique minimum point of msq, and that est„ = msq2„, n e N+. Note that 
in such a case the variables dt satisfying Condition 167 as assumed for CGM above can be e.g. 
the results of GSSM or GMSM respectively for ^t„ = msq„ as in Section 7.6. From conditions 
167, 162, and the fourth point of Theorem 145 for CGSSM or the fifth point of Theorem 154 for 
CGMSM, as well as from theorems 125, 203, and Remark 197, the counterparts of conditions 
167 and 168 and Condition 202 hold for CGSSM and CGMSM. 

Let us now consider CGSSM or CGMSM in the ECM setting for est„ = ic„. Let us assume 
Condition 187 for b* as above and that C = 1 as discussed in Remark 41, so that ic = var and 
b* is its unique minimum point. Note that in such a case the variables dt satisfying Condition 

167 as above can be e.g. the results of GSSM or GMSM as in Section 7.6 respectively for 
^tn = rnsq„ or ^tn = msq2„, for ^tn = msq„ additionally assuming (7.106) as in that section. 
From Condition 167, the fifth point of Theorem 145 for CGSSM or the sixth point of Theorem 
154 for CGMSM, as well as from theorems 124 and 203, the counterparts of conditions 167 and 

168 and Condition 202 hold for such a CGSSM and CGMSM. 


7.9 Three-phase minimization of estimators with gradient-hased 
stopping criteria and function modifications 

In this section we define minimization methods of estimators in which three-phase mini¬ 
mization can be used. In their first phase one can perform some GM method as in Section 
7.6, in the second a search of step lengths satisfying the Wolfe conditions can be carried 
out on a modification of the estimator considered, and in the third phase one can perform 
unconstrained minimization of the modified estimator using gradient-based stopping criteria. 
The single- and multi-stage versions of these methods shall be abbreviated as MGSSM and 
MGMSM respectively. We also discuss the possibility of the application of such methods to 
the minimization of the inefficiency constant estimators in the LETGS setting. 
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MGSSM and MGMSM will be defined as special cases of the following MGM method. We 
assume in it that Gondition 200 holds, 0 < ai < a 2 < 1, and A = Let d* e A and for T as in 
Section 7.4, let dt, teT,be A-valued random variables such that a.s. 

limdt = d*. (7.127) 

t^OO 

Let random functions ft : i8> .5^([0,oo)), t e T, be such that b ftib,(ji)) is 

continuously differentiable, o) eD., t e T. Let It, teT,he as in the previous section and such 
that additionally a.s. 


limef = 0. (7.128) 

t^OO 

Let for each teT,rt be an IR+ -valued random variable, 

ht{b) = ftib) + ftidt)h{^-^^^), be A, (7.129) 

I't 

and a function dt'-O.^ A and an A-valued random variable d'^ be such that 

dt,d’teBi{dt,rt{l + 5)). (7.130) 

For each teT, let for some [0,oo)-valued random variable pt it hold 

d[-dt = -ptyhtidt), (7.131) 

and let the following inequalities hold (which are the Wolfe conditions on the step length pt 
when considering the steepest descent search direction, see e.g. (3.6) in [41]) 

ht{df]< htidt)- ptai\Vht{dt]f, (7.132) 

Vhtid'fVhtidt) < a2\yht{dt)f. (7.133) 

Finally, in MGM we assume that for each teT 

htidt) < htid't) (7.134) 

and 


\Vhtidt)\<et. (7.135) 

Let ^tfc, k e Np, as in (7.74) be such that b eslkib', b)[(t)) e [0,oo) is continuously differen¬ 
tiable, b’ eU^yioe , k e Np. We define MGSSM and MGMSM as special cases of MGM in the 
same way as CGSSM and CGMSM are defined as special cases of CGM in Definition 195. 
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Remark 205. For a given teT, assuming that the other variables and constants as above are 
given, a possible construction ofpt, d[, and dt in MGM is as follows. On the eventVhtidt) = 0, 
let us set d[ = dt = dt- Let further w eD.be such that 

yhtidtico),a})^0. (7.136) 

Then, to obtain ptiw) and thus also d^icv), one can perform a line search ofht{-,(x)) in the 
steepest descent direction -yhtidtici)),^), started in dticv) and stopped when ptiw) and the 
corresponding d\ (oi) (see (7.131)) start to satisfy the Wolfe conditions (7.132) and (7.133) (eval¬ 
uated on such an w). The line search can be performed e.g. using Algorithm 3.5 from [41 ]. If 
this algorithm is used for each w as above, then such constructed d[ is a random variable. Let 
further the variable d’^. as above be given. Consider now w e D satisfying (7.136) andctioj) > 0. 
Then, under some additional assumptions on ft{-,(x)), to construct dtiw) one can use some 
convergent unconstrained minimization algorithm ofht{-,a)) started in d[{(x)) and stopped 
in the first point dtiw) in which (7.134) and (7.135) hold. See e.g. the assumptions of the 
Zoutendjik theorem in Remark 188. Note that from (7.136) and ht being nonnegative it holds 
ht[dt{w),a)) > 0, and thus ht[x,(i)) > ht{dti(ji)),a)) for |x- df(fc»)| > rt{(i)){\ + 5), so that from 
ht{dt{w),(ji)) < ht{d'f[a)),(ji)) < ht[dt{(ji)],a)), (7.130) holds. Ifctiv)) >0 for each o) such that (7.136) 
holds, and the same unconstrained minimization algorithm is used for each such w, then from 
the definition of such an algorithm it typically follows that such constructed dt is a random 
variable. Fora) e D such that we have (7.136) andctio)) = 0, dfio)) can be e.g. some (global) 
minimum point of ht[-,a)). 

For X e and B c let us denote 

d(x,B) = inf |x-y|. (7.137) 

yes 

The following theorems will be useful for proving the convergence properties of MGM meth¬ 
ods. 

Theorem 206. Let K be nonempty and compact and let gn ’■ K —>■ U, n e N+, converge 
uniformly to a continuous function g\K^U. Let m be the minimum ofg and B be its set of 
minimum points. Then, for each sequence of points dn & K, ne N+, such thatlim„^oo gni^n) = 
m, we have\im„^oo<iid,„,B) = 0. 

Proof. Let e e IR+. From the continuity of x e .RT — d(x,B), Kz := {x e K : d{x,B) > e} is a 
closed subset of K and thus it is compact. From g being continuous, it attains its infimum 
w := infx£jc 2 g(M on Kz, and thus we must have 5 := w - m > 0. For sufficiently large n for 
which |g„(x) - g(x)| <j,xeK, and \g„{dn) - m| < f, we have \gid„) -m\< \gid„) - gnid„)\-t 
Ignidn)- m\<S and thus €Kz, i.e. d(d„,B) <e. □ 

Theorem 207. Let f: — and /„ : — IR, n e N+, fie continuously differentiable and such 

loc loc 

that fn =1 / andVfri ^ V/. Let further for some d* e IR , s e IR+, and0< w < r <oo it hold 
f[b)>f{d*) + s, beU\ \b-d*\>w, (7.138) 
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and let Vn e [R+, n e N+, be such = r. Let for a sequence dn e IR^ n e N+, it hold 

limd„ = d*. (7.139) 

n^oo 

Let for each neM+,hn'M‘ ^Ube such that 

hnm = fnm + fnidnW^^^), beUK (7.140) 

^ n 

LetCn >0, n e N+, be such thatlimn^oo^n = 0» tind let for each n e N+, for some p„ e [0,oo), 
points d’„,dn e B/(dn,(l + 5)r„) be such that 


dn dn — Pn^hnldn)} 


(7.141) 


hnidn) — fnidn) PntCll^fnidn)\ : 


(7.142) 


Vhnid'nW fnidn) < afV f„id„)f, 


(7.143) 


hnidn) <hnid’n), (7.144) 

and 

\yhnidn)\<en. (7.145) 

Then, for a sufficiently large n we have 

d'n,dneBiid*,w) (7.146) 

and 

\yfnidn)\<en. (7.147) 

Let further 

(f)iu) = Vfid*-uyfid*))yfid*)-a 2 \yfid*)\^, ueU, (7.148) 

and V = inf{u > 0: (piu) = 0}. Then, if\y fid*)\ = 0 then v = 0 and if\yfid*)\ 0 then v e (0, w). 
Furthermore, for 

p = fid*)-vai\yfid*)f (7.149) 
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we have 

limsup/(d^) < (7.150) 

n^oo 

limsup/(d„) < p, (7.151) 

n^oo 

forE = {x e : fix) < p}, we have 

lim d(d' £■) = 0, (7.152) 

n^oo 

and for the set D = {x e : V/(x) = 0, fix) < p} c £, containing the nonempty set of minimum 
points off, we have 

lim d(d„,D) = 0. (7.153) 

n^oo 


Proof Let.Jr = Biid*, w). Let Ni e N+ be such that for n> Ni,\dn-d*\ < and r-rn< 
in which case for xe Kwe have 

^ r — w r — w r — w 

\x-dn\<\d -dn\ + \x-d |< ^ +w< ^ -m;+(—-(r-r„)) = r„, (7.154) 


so that 

KcBiidn,rn). (7.155) 

From the set F := U“=iF;(d„, (1 + 5)r„) being bounded, let A /2 > Ni be such that for n > A /2 
|/„(x)-/(x)|<|, xeF. (7.156) 

From Lemma 191, 

lim fnidn) = fid*), (7.157) 

n^oo 

and thus let A /3 > N 2 be such that for n> N 3 

\fnidn)-fid*)\<^. (7.158) 

Then, for n > /V 3 and x e F such that |x - d* | > lu, we have 

h„ix) > fnix) > fix) - I > fid*) + I > fnidn), (7.159) 


where in the first inequality we used (7.140) and Condition 200, in the second (7.156), in the 
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third (7.138), and in the last (7.158). Thus, since from (7.142) and (7.144), 

(7.160) 

for n > A/3 we have (7.146). For n > A/3, from (7.155) and (7.146), we have h„{d'„) = fnid'„), 
Vhnid^) = yf„{d'„), and similarly for d’„ replaced by dn, so that from (7.145), (7.147) holds, 
and from (7.160), 

fnidn)<f„{d'„)<f„{d„). (7.161) 

If V/(d*) = 0, then v = 0, in which case (7.150) and (7.151) follow from (7.161), the sequences 
id'„)neN+ and (dn)n£N+ being bounded, and Theorem 192. Letnow V/(d*) ^ 0. Then, (/)(0) = 
(1 - a 2 )|V/(d*)|^ > 0 and thus from the continuity of 0, v > 0. The fact that v < w follows 
from (7.138) and Lemma 3.1 in [41] about the existence of steps u > 0 satisfying the Wolfe 
conditions: (p{u) < 0 and f{d* - M f{d*)u) < fid*) - uafV fid*)\^. Let 0< v’ < v. Then, from 
the continuity of (p, 

inf (/)(m)>0. (7.162) 

0<u<v' 

For n e N+, and u e U, let cpniu) = V/„(d„ - V fnidn)u)V fid*) - a 2 |V/(d*)|^. Since from 
Lemma 191 

lim V/„(d„) = V/(d*), (7.163) 

n^oo 

the function dn - MV/„(d„) converges uniformly to u-^d*- uV fid*) on [0, v’], and thus 
from Theorem 144, fn converges to (p uniformly on [0, v’]. Thus, from (7.162), let A/4 > A/3, be 
such that for n > A/4, infue[o,i>'] (pniu) > 0. For such an n, from (7.141) and (7.143) it must hold 
Pn > v' and from (7.142) we have 

fnid'n) < fnidn) - v'afV fnidn)\^, (7.164) 

and thus 

fid'n)< fid'n)-fnid'n) + fnidn)-V'ai\yfnidn)\^. (7.165) 

From (7.146) and the fact that /„ converges to / uniformly on K, we have lim„^oo(/(t^n) - 
fnid'n)) = 0. Thus, from (7.165), (7.157), and (7.163), 

limsup/(d^) < fid*) - v' a\\SI fid*)f'. (7.166) 

n^oo 

Since this holds for each v' < v, we have (7.150), and from (7.161), we also have (7.151). Due 
to (7.138), / attains a minimum. For each minimum point Xq of / we have V/(xo) = 0 and 
from (7.150) and /(xq) < fid'n), n e N+, we have /(xq) < p. Thus, Xq e D. The minimum of 
g := (/ V p)\K is equal to p and E c K is its set of minimum points. From (7.150), we have 
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lim„^oo/(^^n) V = ;U. Thus, from (7.146) and Theorem 206 for such a g and gn = g, ne N+, 
we receive (7.152). The minimum of g := (| V/| + / v g.)\K is ju and its set of minimum points 
is DcK. From e„ 0, (7.147), and (7.151), lim„^oo(|V/«(d„)| + /(d„) v p) = p. Thus, from 
(7.146) and Theorem 206 for such a g and g„ = (|V/„| + /vp)|jf, n e N+, we receive (7.153). □ 


Let us now discuss how MGSSM and MGMSM can he applied in the LETGS setting for est^ = 
ic„, n e Np, for p = 2. We assume conditions 76,126, and Condition 115 for S = Z^. Then, from 
Theorem 125, msq has a unique minimum point d*. The variables dt, teT, such that (7.127) 
holds a.s. for such ad*, can be obtained e.g. using GSSM or GMSM methods respectively for 
^t„ = msq„ as in Section 7.6. Furthermore, for a positive definite matrix M and its lowest 
eigenvalue m > 0 as in Theorem 125, we have from (6.73) that 

var((i* + fi) > var(d*) +(7.167) 
and thus 

icid*+ b)>Cminiv^^id*) + ^\bf), beuK (7.168) 

For some (7i,o'2 e IR+, ni < (J 2 , let us define 



icid* 


-var(d*) +a 2 


and 


(7.169) 


w = 


'icid*) 

, ('min 


-varid*] -t-cTi. 


(7.170) 


It holds r>w>0 and from (7.168), for b e \b\ > w, 

^ ^ muP" ^ m 

icid +fi)>Cm/n(var(d ) + ^—) = ic(d )-HyCm;„o-i, (7.171) 

so that we have (7.138) for / = ic and s = y CminCri- 

Let us assume that Condition 115 holds for S = C (in addition to this condition holding for 
S = as assumed above), so that from the fourth point of Theorem 123, ic is smooth. Let p, 
E, and D be as in Theorem 207 for / = ic and d* as above. Note that we have p < icid*) only if 
Vic(d*) ^ 0, which from Remark 129 holds only ifvar(d*) ^ 0 and Vc(d*) ^ 0. Let for n e N+ 
and beU^ 


Mnib) = 


2L(fi)GZ2exp(-i ^ \rii\^) 

^ i=i 


(7.172) 


and let fhnib) = miiMnib)) for m; as in Section 6.3, i.e. fhnib) is the lowest eigenvalue of 
Mnib). For n e Np, and b,de IR^ let us define f„ib, d): Q” R to be such that for w e Q” for 
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which fhn ib) (to) > 0 and ic{b, d) (to) - Cmin^^^n bb, d) (to) > 0 


?n{b,d){(D) = 


m„(h)(to) 


ic„(h, d)(to) 


-var„(h, d)(to) +t72, 


(7.173) 


and otherwise r„ {b, d) (to) = a for some a e IR+. 

Let us now focus on MGSSM, for which let us assume Condition 115 for S = 1 and that = 

loc 

'iiNt = ke Np)fjcib' ,dt)iKic), teT. Then, from Theorem 145, a.s. b var„(h', h)(K„) var 

^ /oc 

and b — icnib', h)(K„) ic. Thus, from Lemma 191 and Condition 162, we have a.s. 'i{Nt = 
k e Np)varfc(h',df)(Kfc) ^ var(d*) and MATf = ke Np)icfc(h',df)(Kfc) ^ ic(d*). Furthermore, 
from the SLLN, a.s. Mn[b')[k„) — M and thus from Lemma 80, mn[b')[kn) —' m. Therefore, 
a.s. limf^oo = r. Thus, from Theorem 207 and Remark 197 we receive that the following 
condition holds for MCSSM. 

Condition 208. Condition 202 holds, a.s. limsupj^ooic(t^r) ^ /r, limsupf^ooic(df) < p, limf^ood(dj,£) 
0 , and 


lim d(dr,D) = 0. (7.174) 

t^OO 

Furthermore, a.s. for a sufficiently large t, d[, dt e Bi [d*, w). 

For MCMSM let us assume that conditions 146 and 147 hold, that Condition 151 holds for 
S = Z^, S = C, and S = 1, and that rjc = fn^{bk-i,dk){Xk)< k e N+. From Theorem 154, a.s. 

loc ^ loc 

b var„^.ibic-i,b)iXk) ^ var and b icn^ibk-i, b]ixic) ^ ic- From Holder’s inequality and 
Theorem 120 it easily follows that Condition 151 holds for S equal to the different entries 
of Mi(0). Thus, from the first point of Theorem 154 a.s. Mn,,{b]c-i){Xk) M, and thus 
m„j,(fifc_i)(jfc) — m. Therefore, we have a.s. limt^oo Thus, from Theorem 207 and 

Remark 197 it follows that Condition 208 holds for MCMSM. 

Theorem 209. Let functions var, c, and ic be as in Section 4.1 for A open, let var be lower 
semicontinuous and convex and have a unique minimum point b* e A, and letic be continuous 
in b*. Let for some dn^ A, neN+, 

limsupic(d„) < ic(fi*). (7.175) 

n^oo 

Then, 

liminfvar(d„) > var(fi*) (7.176) 

n^oo 

and 


limsupc(dn) < c(fi*). (7.177) 

>00 

Proof. For some ic(fi*) > s > limsup„^ooic(d„), let c e IR+ be such that ic(fi) > s for fi e 
Biib*,e) c A. Then, d„ e A \ Biib*,e) for a sufficiently large n. From the semicontinuity 
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ofvar, for some bo e Siib*,e), var(foo) = mini,es,(b*,e) var(fo) > vavib*), and thus from the con¬ 
vexity of var it holds var{h) >var(ho) for b e A \ Biib* ,£), and we have (7.176). Note that from 
(7.175), ic(h*) > 0 and thus var(h*) > 0. Therefore, 


limsupc(d„) < 

n^oo 


limsup„^ppic(<in) 

liminf„^ooVar(d„) 


icib*) 

< - 

var(h*) 


cib*). 


(7.178) 


□ 


If V ic(d*) 5 ^ 0, so that ju < ic(d*), then for Ct = dt or Ct = as above for which we have a.s. 
limsupf^ooic(ct) < p, from Theorem 209 it also holds a.s. liminff^ooVar(Cf) > var(d*) and 
limsupf^ooC(Cf) < c[d*). 

Condition 210. D = {b*} for some b* e 

Remark 211. Note that Condition 210 holds under the assumptions as above e.g. ifC is a 
positive constant or ifvarid *) = 0, and in both these cases b* = d*. 

Remark 212. Let us assume Condition 210. Then, b* is the unique minimum point of ic as 
above. Furthermore, under the above assumptions for MGSSM and MGMSM, from (7.174) and 
Lemma 191, counterparts of conditions 167 and 168 hold in these methods (by which we mean 
the same as above Remark 204). 


Note that Remark 204 applies also to MGMSM. 


7.10 Comparing the first-order asymptotic efficiency of minimiza¬ 
tion methods 

Let A e SSiU^) he nonempty and T c [R+ he unbounded. Consider a function (p\5A{A)-^ S^iU) 
and an A-valued stochastic process d = {ddt^T- For t e T, dt can be an adaptive random 
parameter trying to minimize cf) for t being e.g. the simulation budget, the total number of 
steps in SSM methods, or the number of stages or simulations in MSM methods, used to 
compute dt. We describe some such possibilities in more detail in the below remark. 

Remark 213. In the various SSM methods as in the previous sections, for some T as in Condition 
162, we can consider dt equal to dt, dt, or d'^ as in these methods, teT (see Remark 163). Let 
us further consider the case of the various MSM methods as in the previous sections. Then, for 
variables pic equal tod^, d^, ord'^ as in these methods, k e N+, forsomeNu{oo}-valued random 
variables Nt, te T, and some A-valued random variables po and p^o, one can set 

dt = PN,^iNt^oo) +Poo'^{Nt = oo), teT. (7.179) 

The simplest choice would be to take T = N+ and Njc = k, so that dk = Pk> k eN+, i.e. k is 
the number of stages of MSM in which dk is computed. If we want tcT to correspond to the 
number of samples generated to compute dt, then for Sk := Lf=i tii, k e N+, and T = {Sk, k e N+}, 
we can take Ns^. = k, k eM+. Alternatively, we can take T = R+ and for each t c T, Nt can 
be the smallest number of stages using the simulation budget t, or the highest such number 
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before we exceed that budget. Let us discuss how one can model this. For some [0, oo)-valued 
random variables Mi modelling the costs of the minimization algorithms in the i th stage of 
MSM (we can set Mi = 0 if we do not want to consider them), i e N+, and U being a theoretical 
cost variable analogous as of a step ofSSM in Remark 163, under Condition 135 we can take e.g. 

Nt = infik e N: X (M; + f; Uixi.j)) ^ t} (7.180) 

i=i j=i 


or 

k tii 

Nt = suplfce N:Y^{Mi+Y^mxi.j))<t}. (7.181) 

1=1 ;=1 

Note that if we have (7.179) and a.s. (2.7) and (2.10) and one of the following holds: a.s. 
Pic b*, for some f: IR a.s. fipk) fib*], or for some me K, a.s. limsup,t^oo/(pfc) < m, 

then we have respectively that a.s. dt b* (compare with Remark 2), fidfi — fib*), or 
limsupf^oo/(df) < m. ForNt as in (7.180) or (7.181), (2.7) holds a.s. ifU < oo, <Q{b] a.s., be A. 
Furthermore, (2.10) holds a.s. if Condition 127 holds for C = U, or from Theorem 138, if for 
someK e SSiA] such that inf> 0, Condition 134 holds for Ki = K, i e N+, and the 
assumptions of Theorem 139 hold for g[v,x] = Uix), ve A, xeLli. 

For each K-valued stochastic process b = ibt) kt, let us denote a- [b] = sup{x e IR: limt^oo P(ht > 
x) = 1} and a+{b) = infix e IR:limt^ooP(hf < x) = 1} = - 0 -{-b). Note that (J-{b) < cr+(h) and 

_ M 

a-[b) = a+ib) = X eU only if hf ^ x. For b’ analogous as b we have u-ib' -b)> a-ib')-o^ib). 
In particular, for each 5 e IR such that O-[b') - a+ ib] > 5, we have limf^oo Pib’ -b> 6] = I, and 
such a 6 can be chosen positive if cj- ib') > a+ib). 

Ford= (dt)f£r as above, let us denotecpid) = i(pidt))t£T- Let d'be analogous as d. We shall call 
d asymptotically not less efficient than d' for the minimi z ation of f if a+ ifid)) < a- icpid')), 
and (asymptotically) more efficient for this purpose if this inequality is strict. If (pidt) and 
fid'j) both converge in probability to the same real number, then d and d' shall be called 
equally efficient. We call such defined relations the first-order asymptotic efficiency relations, 
to distinguish them from such second-order relations which will be defined in Section 8.3. 

For instance, for some d = {ddiEi as above, which can be some parameters corresponding 
to the single- or multi-stage minimization of some mean square estimators as in the above 
remark, and d* being the unique minimum point of mean square, let it hold a.s. dt d*, 
and thus assuming further that ic is continuous in d*, also a.s. icidt) —' icid*). Let further for 
id'f) Ti which can be some parameters corresponding to the minimization of the inefficiency 
constant estimators as in the above remark, it hold for p as in Section 7.9 (for which p < icid*) 
and if Vic(d*) ^ 0, thenp < icid*)), that a.s. limsupf^ooic(dj) < p. Then, d' is asymptotically 
not less efficient for the minimization of ic than d and more efficient if Vic(d*) o. 

Let now some d = (df) f£ r as above, which can be some parameters corresponding to the single- 
or multi-stage minimization of the cross-entropy estimators as in the above remark, fulfill 
a.s. dt — p* for p* being the unique minimum point of the cross-entropy. Let further d' = 
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which can correspond to the minimization of mean square or inefficiency constant 
estimators for C = 1, fulfill a.s. dt —' b* for b* being the unique minimum point of msq, which 
is continuous in b* and convex on the set on which it is finite. Then, d' is asymptotically not 
less efficient than d for the minimization of msq, and more efficient ifb* p*. 

7.11 Finding exactly a zero- or optimal-variance IS parameter 

In this section we describe situations in which a.s. for a sufficiently large t, the minimization 
results dt of our new estimators are equal to a zero- or optimal-variance IS parameter b* 
as in Definition 26. When proving that this holds in the below examples we shall impose 
an assumption that we can find the minimum or critical points of these estimators exactly. 
Even though such an assumption is unrealistic when minimization is performed using some 
iterative numerical methods, it is a frequent idealisation used to simplify the convergence 
analysis of stochastic counterpart methods, see e.g. [30, 53, 32]. Note also that such an 
assumption is fulfilled for linearly parametrized control variates (at least when the numerical 
errors occurring when solving a system of linear equations are ignored) when a zero-variance 
control variates parameter exists (see e.g. Chapter 5, Section 3 in [5]), in which case the below 
theory can be easily applied to prove that a.s. such a parameter is found by the method after 
sufficiently many steps. 

For a nonempty set A e ^(IR^), consider a function h : 5^{A) ^ .5^(IR). We will be most 

interested in the cases corresponding to the below two conditions. 

Condition 214. Condition 18 holds and h[b,a)) = (ZL(h))(to), o) eD.\, b e A. 

Condition 215. Condition 18 holds and h[b,a)) = (|Z|L(fi))(a»), cneidi, be A. 

Let b* e A and consider a family of distributions as in Section 3.4, satisfying Condition 22. 

Condition 216. For some e R, Qi a.s. (and thus also Q{b) a.s. for each be A) 

Hb*,-) = f. (7.182) 


Condition 217. Condition 214 holds and b* is a zero-variance IS parameter. 

Condition 218. Condition 215 holds and b* is an optimal-variance IS parameter. 

Remark 219. From the discussion in Section 3.2, under Condition 217, Condition 216 holds for 
)S = a = Eqj (Z), and under Condition 218 —for f = Eqj (|Z|). 

Remark 220. If conditions 32 and 216 hold, then a.s. 

hib*,Ki) = f, ieN+. (7.183) 

Lemma221. If conditions 135 and 216 hold, thena.s. 

hib*,Xk.i) = f, ie{l,...,nk}, keN+. (7.184) 
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Proof. For each keN+ and i rik) 

nh{h\xk.i) = P) = Embk-i)ih{b* r) = f)) = I, (7.185) 

so that (7.184) holds a.s. as a conjunction of a countable number of conditions holding with 
probability one. □ 

Let D e be such that b* e D. For each n e N+ and e L2”, let us define 


B„ia)) = {be D:h{b,a)i) = h{b,a)j) eR, i,je{l,...,n}}. (7.186) 

For some peN+ and each n e Np, let ^t„ be as in (7.74). Consider the following condition. 

Condition 222. For each n e Np, if the set Bn (w) is nonempty, then for each de A, Bn (w) is a 
subset of the set of the minimum points ofb e D — ^t„ [d, b) (a»). 

Note that if Condition 222 holds, then it holds also for D replaced by its arbitrary subset (where 
D is replaced also in (7.186)). 

Remark 223. Let us assume Condition 23 and letD = A. It holds for b',be A 

rhi52„(h',fi)=4 E —^-^{\Zi\Li{b)-\ZfLj{b)f + (f^)n)'^. (7.187) 

^ ^i^b)Lj(b) 

Thus, under Condition 215, Condition 222 is satisfied for p = \ and esfn equal to msq2„ 
or msq2„ (for the latter see (7.107) and (7.108)), or for p = 2 and ^t„ equal to vaf„ (which 
is positively linearly equivalent to msq2„ in the function ofb as discussed in Section 4.2). 
Furthermore, under Condition 214, Condition 222 is satisfied for p = 2 and ^t„ = ic„ (see 
formulas (4.23) and (4.29)) orp = 3 and ^t„ = ic2„ (see (4.31)). 

Lemma 224. If conditions 32 and 216 hold, then a.s. for each k e Np, b* e Bkikf). If further 
Condition 222 holds then a.s. for each keNp and d e A, b* is a minimum point ofb eD — 
estfc(d,h)(Kfc). 

Proof. It follows from Remark 220 and (7.186). □ 

Condition 225. Conditions 32 and 162 hold and functions dt'.Ll-^ B, t e T, are such that a.s. 
for a sufficiently large f, fi e D ^tjv, {b', b) [kpfj has a unique minimum point equal to df. 

Theorem 226. If conditions 216, 222, and 225 hold, then a.s. for a sufficiently large t, dt = b*. 

Proof. It follows from Lemma 224. □ 

Lemma 227. If Condition 135 holds for n^ e Np, keN+, and Condition 216 holds, then a.s. for 
each k eN+, b* e BnffXk)- If further Condition 222 holds, then a.s. for each keN+ and de A, 
b* is a minimum point ofb eD^ estn,,id,b)[Xk)- 

Proof. It follows from Lemma 221 and (7.186). □ 
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Condition 228. Condition 135 holds for n^ e Np, k e N+, and functions dk'-kl—^B,keN+, are 
such thata.s. for a sufficiently large k,beD^ b){Xk) has a unique minimum point 

equal to dk- 

Theorem 229. If conditions 216, 222, and 228 hold, then a.s. for a sufficiently large k, dk = b*. 

Proof It follows from Lemma 227. □ 

Let us consider the ECM setting and assume Condition 187 and that b* is an optimal-variance 
IS parameter. Let us take ^t„ = msq2„ as in (7.107), and D = A. As discussed in Section 7.6 
for ESSM (that is for GSSM for Cf = 0, t e T), conditions 166 and 167 hold. Thus, a.s. for a 
sufficiently large t for which Xiv, e and dt & A, b e D ^ msqZpjfb’,b)[kN,) has a unique 

minimum point equal to dt, i.e. Condition 225 holds. Thus, from remarks 219, 223, and 
Theorem 226, a.s. for a sufficiently large t e T, dt = b*. For EMSM, under the assumptions 
as in Section 7.6 which ensure that Condition 166 holds. Condition 228 holds and thus from 
remarks 219, 223, and Theorem 229, a.s. for a sufficiently large keM+, dk = b*. 

Let us now consider CCSSM and CCMSM in the LETCS setting for est„ = msq2„ and b* being 
an optimal-variance IS parameter, or in the ECM setting for ^t„ = ic„, C = 1, and b* being a 
zero-variance IS parameter. Let D be a bounded neighbourhood of b*. Let us consider the 
corresponding assumptions as in Section 7.8 for et = 0, teT, for the appropriate T as in that 
section. Then, from Theorem 203 we receive that for dt = dt and ^t^ = estn. Condition 225 
holds for CGSSM and Condition 228 holds for CCMSM. Thus, from remarks 219, 223, and 
theorems 226 and 229, a.s. for a sufficiently large t e T, dt = b* in CGSSM and CCMSM. 

Let now b* be a zero-variance IS parameter and consider the MGSSM and MGMSM methods 
in the LETCS setting for est,j = ic„, n e N 2 , and It = 0, t e T, under the assumptions as in 
Section 7.9. Then, from Remark 130, icib*) is positive definite. Thus, from the continuity 
of V^ic and from Lemma 80, V^ic is strongly convex on some open ball U with center b*. 
Therefore, from theorems 207 and 189, as well as remarks 211 and 212, conditions 225 and 
228 hold for such a MGSSM and MGMSM respectively for dt = dt, ^t„ = ic„, and DcU being 
some neighbourhood of b*. Therefore, by similar arguments as above, a.s. for a sufficiently 
large t, dt = b* in MGSSM and MGMSM. 
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8 


Asymptotic properties of minimiza¬ 
tion methods 

8.1 Helper theory for proving the asymptotic properties of minimiza¬ 
tion results 

For some Z e N+, let A c be open and nonempty, /: A — R be twice continuously differen¬ 
tiable, b* e A, and H = V^f{b*). 

Condition 230. Vfib*) = 0 and H is positive definite. 

Condition 230 is implied e.g. by tbe following one. 

Condition 231. H is positive definite and b* is the unique minimum point off. 

Remark 232. Let us assume Condition 230. Then, from Lemma 80 and the continuity ofV^f, 
for an open or closed ball B c A with center b* and sufficiently small positive radius, f is 
strongly convex on B and from the discussion in Section 6.9, b* is the unique minimum point of 

f\B- 

Let r c R+ be unbounded. Consider functions ft: 5T{A) 0 t e T, such that 

b — ftib, ( 1 )) is twice continuously differentiable, t e T, o) eD.. We shall denote ftib) = ftib, •) 
andV'/;{Z;) = V|,/;{Z;,-), Z = l,2. 

Condition 233. For some neighbourhood D e ^{A) ofb*, S/^ft converges to V^/ on D uni- 
formly in probability (as t oo), i.e. sup^^g^, |\V ftib) - V fib) | loo 0. 

Let dt, teT,he A-valued random variables. 

Condition 234. It holds dt^b*. 

For gf e R+, t e T, we shall write Xt = Opigt) if 0 (as Z ^ oo). Let rt e R+, Z e T, be such 
that limr^oo rt = oo. 

Condition 235. For some nonnegative random variables 61 , te T, such that5t = Opirf^) and 
for some neighbourhood U e SSi A) of b*, for the event At thatdt isa5t-minimizerofft\u (in 
particular, dt e U, see Section 6.9), we have 

lim P(Af) = 1. (8.1) 

t^OO 
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Condition 236. For some -valued random variable Y, \/T^Yft{b*] ^ Y. 

The below theorem is a consequence of Theorem 2.1 and the discussion of its assumptions in 
sections 2 and 4 in [51], of the implicit function theorem, and of the below Remark 239 (see 
also formula (4.14) and its discussion in [52]). 

Theorem 237. Under conditions 230, 233, 234, 235, and 236, we have 

s/rt{dt-b*)^-H-^Y. (8.2) 

In particular, ifY ~ JY [0,1.) for some covariance matrix 1. e then 

s/TMt-b*)-^ (8.3) 

We will need the following trivial remark. 

Remark 238. Note that if for random variables dt and at, teT, with probability tending to 
one (as t —>■ oo) we have at = at, then for each g:T—>-U, git){dt - ad — 0. 

Remark 239. Theorem 2.1 in [51 ] uses T = N+ and rn = n, ne T, but its proof for the general 
T and rt, t e T, as above is analogous. Let us assume the conditions mentioned in Theorem 
237. Let from Remark 232, B be a closed ball contained in the set D as in Condition 233 and U 
as in Condition 235, and such that f is strongly convex on B and b* is the unique minimum 
point of f\B. From the generalization of Theorem 2.1 in [51] to the general T and rt as above 
and the discussion of assumptions of this theorem in [51], one easily receives the thesis (8.2) 
of Theorem 237 under the additional assumptions that we have dt ^ B, t e T, and Condition 
235 holds with At = L1, t £ T. From Remark 238 and Condition 234, to prove Theorem 237 it is 
sujficientto prove (8.2) with dt replaced bydt = 'i{dt& B)df + '|](df t B)b*. ForCt = A; n [df e B}, 
letStico) = dtiw), 0 ) e Ct, andStiw) = oo, to eD.\Ct. Then, from Remark 238, Condition 234, 
and (8.1), for dt replacedby dt anddt bySt, the conditions of Theorem 237 are satisfied and the 
above additional assumptions hold. Thus, (8.2) withdt replaced by dt follows from Theorem 
2.1 in [51 [ as discussed above. 

Condition 240. On some neighbourhood K e SS[A) ofb*, for i = 1,2, the ith derivatives of 
b — ftib) (i.e. y ft and Y^ft) converge to such derivatives off uniformly in probability. 

Condition 240 is implied e.g. by the following one. 

Condition 241. On some neighbourhood K eSS{A) ofb*, for i = 1,2, a.s. the i th derivatives of 
b — ftib), converge uniformly to such derivatives off. 

_ 1 

Condition 242. It holds \ y ffidt) \ = Opir^^). 

Lemma 243. If conditions 230, 234, 240, and 242 hold, then Condition 235 holds. 

Proof. From Remark 232, let U be an open ball with center b*, contained in K as in Condition 
240, and such that / is strongly convex on U with a constant s e IR+. Let m e (0, s). Then, from 
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Theorem 189, there exists e e K+ such that for each twice differentiable function g: f/ — D? for 
which 


supd I V V W - v2g(x) I loo + I v/(x) - Vg(x) I) < e, (8.4) 

X£(7 

each beU is a ^|Vg(h)|^-minimizer of g. Thus, Condition 235 for the above U and St = 
^ I V/f {df)|^ follows from conditions 234, 240, and 242. □ 

From the above lemma we receive the following remark. 

Remark 244. The assumption that conditions 233 and 235 hold in Theorem 237 can be replaced 
by the assumption that conditions 240 and 242 hold. 

Consider the following composite condition. 

Condition 245. Conditions 230, 234, 240, and 242 hold. 

Theorem 246. Let us assume Condition 245 and that 

limP(V/f(h*) = 0) = l. (8.5) 

t^OO 

Then, 


s/rtidt-b*)^0. ( 8 . 6 ) 

Proof. From (8.5), Condition 236 holds for F = 0, so that the thesis follows from Remark 244 
and Theorem 237. □ 

_ 1 

Condition 247. For some nonnegative random variables St, teT, such that St = Op[r^ ^), with 
probability tending to one (as t — ooj we have 


\yft[dt)\<5t. 


(8.7) 


Lemma 248. Conditions 242 and 247 are equivalent. 

Proof. If Condition 242 holds, then Condition 247 holds for St = |V/f(df)|. Let us assume 

Condition 247. Then, for equal to dr if (8.7) holds and oo otherwise, we have |V/r(dr)l ^ dr 
~ _ 1 

and from Remark 238, St = Op{r^ ^), from which Condition 242 follows. □ 

8.2 Asymptotic properties of functions of minimization results 

Let us consider T and rt, teT, as in the previous section. We shall further need the following 
theorem on the delta method (see e.g. Theorem 3.1 and Section 3.3 in [55]). 

Theorem 249. Let m,neN+, let D e he a neighbourhood of 6 e K'”, and consider a 

function (p: ST [D) — ST{U"). Let Yt, teT, and Y be D-valued random variables such that we 
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have s/T~t{Yt-d) ^ Y (as t ^ oo). If (p is differentiable in 6 with a differential (p'{6) (whichwe 
identify with its matrix), then 

,/rt[(p{Yt)-(p{d))^(P\e)Y. ( 8 . 8 ) 

Ifcp is twice differentiable in 6 with the second differential cp" [6] and we have (p' {6) = 0, then 

rt{(p[Yt)-(p[9))^^(P”{d)iY,Y). (8.9) 

Remark 250. For m e N+, let x^irn) denote the xsquared distribution with m degrees of 
freedom. Let me N+, S ~ ^(0,7^), B e Sym^(IR), andX = S^BS. Then, X has a special case of 
the generalized X squared distribution, which we shall denote as x^iB). ForB being a diagonal 
matrix B = diag(i') for some v e K'”, ;t(^(7?) will be also denoted asx^iu). ItholdsE{X) = Tr(i?). 
IfB = wim for some w e [R, then we have X ~ wx^im) (by which we mean thatX ~ wY for 
Y ~ x^im)). Let eig(7?) e IR™ denote a vector of eigenvalues ofB and let A = eig(7i) . Consider an 
orthogonal matrix U e such thatB = i/diag(A)t/^. Then, for W = U^S ~ jViO, Im) u)e 
have 


X=W^diagiX)W=Y,^iWf, (8.10) 

i=l 

and thusx^iB) = ^^(A). Let A = {0}u {f e (IR\ {0})^ : fce N+, vi<V 2 <...< Vk), i.e. A is the set of 
all real-valued vectors in different dimensions with ordered nonzero coordinates or having only 
one zero coordinate. Let for v e W”, ord(i’) e X he equal toOeU if v = 0 and otherwise result 
from ordering the coordinates ofvin nondecreasing order and removing the zero coordinates. 
Then, we have = ;t(^(ord(t')). For Y ~ ;t;^(l) ure have a moment-generating function 
My it) := E(exp(tF)) = (l-2t)“2,f<^. Thus, for each A e N+, ve An IR*, Y ~ x^iu), and f e IR 

such thatl-Zvit > 0, i = 1. k, we have My it) = nf=i(l t)~ 2 . Such an My is defined on 

some neighbourhood ofO and it is a different function for different ve A. Thus, for vi,V 2 ^ A 
such that Vi V 2 , we have x^ivi) X^if 2 )- It follows that for two real symmetric matrices Bi, 
B 2 , we have x'^iBi) = X^i^z) only i/ord{eig{Bi)) = ord(eig(B 2 ))- 

Remark 251. Using notations as in Theorem 249, let Y ~ jYiO, M) for some covariance matrix 
Me R'”’^'”. Then, ( 8 . 8 ) implies that 

ffTticpiYd-cpid)) => ,^i0,(p'id)M(p'id)^). ( 8 . 11 ) 

Let further n=l. Then, (8.9) is equivalent to 

rticpiYt) - (Pid)) ^R-=^ Y^y^cpid) Y. (8.12) 

For S ~ ^(0,7/) we have Y ~ S. Thus, from Remark 250, for 

B=^M' 2 y^(Pid)MK (8.13) 
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we haveR ~ x^iB) and 

1 ? 

E[R) = Tr(B) = - Tr(v2(/){0)M). (8.14) 

Note that if 6 is a local minimum point off, then V^fiO) is positive semideflnite and so isB. 

Remark 252. If we have assumptions as in Theorem 237 leading to (8.3), then from Remark 
251,forM= and 

1 1 ill 1 

B=-M 2 HM 2 = -H~ 2 l.H~ 2 , (8.15) 

2 2 

we have 

rtif{dt)-fib*))^f{B). (8.16) 

Note that for R~ ^B) it holds 

E(R) = Tr(B) = ^Tr(Z/fi). (8.17) 

8.3 Comparing the second-order asymptotic efficiency of minimiza¬ 
tion methods 

For some T and rt, teT, as in Section 8.1, let f and d = (dt )kt be as in Section 7.10 and such 
that for some probability p on IR and y e R we have 

rf(0(dr)-y)^p. (8.18) 

Let further for some analogous d’ = {d[)t^T and p' it hold rtifid'f) - y) ^ p'. Then, fidt) y 
and similarly in the primed case, so that the processes d and d' are equivalent from the 
point of view of the first-order asymptotic efficiency for the minimization of f as discussed 
in Section 7.10. Their second-order asymptotic efficiency for this purpose can be compared 
by comparing the asymptotic distributions p and p'. For instance, if p = p!, then they can 
be considered equally efficient. If for each x e R, p((-oo, x]) > p’ii-oo, x]), then it is natural 
to consider the unprimed process to he not less efficient and more efficient if further for 
some X this inequality is strict. The second-order asymptotic efficiency as above can be also 
compared using some moments like means or some quantiles like medians of the asymptotic 
distributions, where the process corresponding to lower such parameter can be considered 
more efficient. For p = x^iB) and p' = j^(B') for some symmetric matrices B and B', which 
can arise e.g. from situations like in remarks 251 or 252, it may be convenient to compare the 
second-order efficiency of the corresponding processes using the means Tr(B) and Tr(B') of 
these distributions. In situations like in remarks 251 or 252, such means can be alternatively 
expressed by formula (8.14) or (8.17) respectively, using which in some cases they can be 
estimated or even computed analytically (see e.g. Section 8.9). For a number of stochastic 
optimization methods from the literature we do not have formulas like (8.18) and some other 
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ways of comparing the second-order asymptotic efficiency of such methods are needed; see 
[54] for some ideas. 

Remark 253. Under the assumptions as above, let pilO,oo)) = 1 and let for some X ~ p and s e 
[l,oo) it hold p' ~ sX. Note that from Remark 250 this holds e.g. if for some symmetric positive 
semidefinite matricesB andB' we havep = (B), p' = x^[B'), andord(eig(B')) = sord(eig(B)). 

If further rt = t, t e T = IR+, then t{(pidt) - y) and tifid^f - y) both converge in distribution 
to p, but computing d'j requires s times higher budget than dt, t e T (assuming that such 
interpretation holds). Thus, the unprimed process can be called s times (asymptotically) more 
efficient for the minimization ofcp. Similarly, ifM' = sM for some nonzero covariance matrix 
Me and for some 6 e A, \/f(df-0) ^ and \/l[d'f- 6 ) ^ jY{Q,M'), then d can be 

said to converge s times faster to 6 than d'. In such a case, if additionally (p is twice differentiable 
in 6 with a zero gradient and a positive definite Hessian in this point, then from Remark 251, 
forB as in (8.13) we have t(0(dr) - y) ^ X^^B) and similarly for the primed process for B' = sB. 
Thus, the unprimed process is s times more efficient for the minimization off. 

Remark 254. Analogously as we have discussed certain properties of minimization results 
in Section 8.1 and their functions in Section 8.2, or proposed how to compare the asymptotic 
efficiency of stochastic minimization methods in Section 7.10 and this section, one can formulate 
such a theory for maximization methods. It is sufficient to notice that maximization of a 
function is equivalent to the minimization of its negative, so that it is sufficient to apply the 
above reasonings to the negatives of appropriate functions. 

8.4 Discussion of some conditions useful for proving the asymptotic 
properties in our methods 

Let us discuss when, under appropriate identifications given below, Condition 245 holds in 
the LETGS and ECM settings for the different SSM methods from the previous sections, i.e. 
for ESSM, GSSM, CGSSM, and MGSSM, and for such MSM methods, i.e. for EMSM, GMSM, 
CGMSM, and MGMSM. We consider in this condition / equal to ce, msq, or ic, each defined 
on A = Furthermore, we take T as for the minimization methods in the previous sections, 
in particular for the MSM methods we take T = N+. For the EM and GM methods we take 
ft = ft and dt as in these methods, while for the CGM and MGM methods ft = ft and dt = dt, 
assuming Condition 196. 

Sufficient conditions for the smoothness of such functions / follow from the discussion 
in sections 6.1, 6.5, and 6.11. The smoothness of such b — ftib,(ji)), o) e Tl, t e T, in the 
LETGS setting is obvious and in the ECM setting it holds under Condition 36, which follows 
from A = as discussed in Remark 37. Sufficient assumptions for Condition 231, implying 
Condition 230, to hold for / equal to ce, in the ECM setting were provided in Section 6.1, and 
in the LETGS setting — in Section 6.5. From the discussion in Section 6.9, for A = R^ as above. 
Condition 231 follows from the strong convexity of /, sufficient assumptions for which for / 
equal to msq or var in the LETGS setting were provided in Theorem 125. For / equal to msq in 
the ECM setting, sufficient conditions for it to have a unique minimum point were discussed 
in sections 6.7 (see e.g. Lemma 106) and 5.1, and for V^/ to be positive definite — in Theorem 
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124. For / = ic, if C is a positive constant, then Condition 231 follows from such a condition for 
/ = var, and some other sufficient assumptions for Condition 231 to hold were discussed in 
Remark 130. From the discussion in Section 7.3 we receive assumptions for which Condition 
241, implying Condition 240, holds in the MSM methods as well as in the SSM methods for 
r = N+ and Nj^ = k, and thus from Condition 162 also in the general case. For dt as above. 
Condition 234 follows from Condition 167 and its counterparts. 

Recall that from Lemma 248, conditions 242 and 247 are equivalent. For the EM methods, 
if h — ^tnib', b)i(ji)) is differentiable, b'eA,a)e Q”, n e Np, and if Condition 166 holds, then 
Condition 247 holds in these methods even for 6t = 0, t e T. For the CM methods, if Condition 

166 holds and Ct = Opir^ (which holds e.g. if Cf = for some ^ < <7 < oo), then Condition 

_ 1 

247 holds for (5f = Cf. In the CGM and MGM methods, if Condition 202 holds and Cf = Opir^ ^), 
then Condition 247 holds for df = Cf. 


8.5 Asymptotic properties of single-stage minimization methods 

Let r c IR+ be unbounded, conditions 17, 18, 23, 32, and 162 hold, A be open, b* e A, and b e 
A — Lib){a)) be differentiable, o) e Gj. For some function ue A—^ [0,oo] such that u{b’) e IR+, 
let us assume that 


Nt p 1 

t u{b')’ 


{t — 00 ). 


(8.19) 


Remark 255. Let Nf be given by some U as in Remark 163, and let u{b) = {U),bE A. For 
such an U being the theoretical cost variable of a step of SSM as in Remark 163, u{b') is such 
a mean cost. Ifuib')eR+, then, as discussed in Chapter 2 (see (2.11)), we have a stronger fact 
than (8.19), namely thata.s. ^ For the special case ofU = 1, we have u{b) = l,be A. 


Below we shall prove that for g substituted by ce, msq, msq2, or ic, under appropriate assump¬ 
tions, for some covariance matrixZg(h') £ and 

ft{b) = mt = keM+)gk[b',b)ikk), teT, (8.20) 

we have 


xTtVftib*) ^ JLiO, uib')I.gib')). 


( 8 . 21 ) 


Remark 256. Let (8.21) hold for some g as above and let Condition 245 hold for the corre¬ 
sponding ft as above and rt = t, t e T, as well as for the minimized function f equal to msq if 
g = msq2, and to g otherwise. Then, from Theorem 237, denotingHf = S/^f[b*) and 

Vgib') = HJ^I.g{b')Hj\ (8.22) 

we have 


xTtidt -b*)^ jV{ 0, uib'Wgib']). 


(8.23) 
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Furthermore, from Remark 252, for 

Bg[b') = ^Hjhg[b')HjK (8.24) 

it holds 

tifidt) - fib*)) => fiuib')Bgib')). (8.25) 


Let u{b') be interpreted as the mean theoretical cost of a step ofSSM as in Remark 255. Let us 
consider different processes d = [dt) ie t from SSM methods for which (8.25) holds for possibly 
different b' and g, and whose SSM methods have the same proportionality constant p^j of 
the theoretical to the practical cost variables of SSM steps (see Remark 163). Then, from the 
discussion in Section 8.3, for R ~ x^iuib')Bgib')), the second-order asymptotic efficiency of such 
processes for the minimization off can be compared using the quantities 


m) = 


u{b') 

2 


Tx{^g{b')HJ^). 


(8.26) 


For the SSM methods having different constants pq, one can compare such quantities multiplied 
by such a pq. 


Let p[b) = j^, let us define the likelihood function lib) = ln[pib)) = -lnL(h), and the score 
function 


Sib) = Vlib) = 


Vpib) 

pib) 


VLib) 
Lib) ’ 


(8.27) 


be A, where such a terminology is used in maximum likelihood estimation; see [55]. Then, 
cenib',b) = -iZL'lib))n and Vi,ce«(fi',fi) = -iZL'Sib))n. Thus, if 


EqT(ZI'S; ib*))^) = iL'iZSi ib*))^)<oo, i = l,...,l. 


(8.28) 


(for which to hold in the LETS setting, from Theorem 121 and Remark 119, it is sufficient if 
Condition 115 holds for S = Z^), and V ceib*) = -EQj(ZS(fi*)) = 0, then, from Theorem 5, for 

Zceib') = Ei}'iiZLfrSib*)Sib*)fr = EQ^ifrZ^Sib*)Sib*)fr, (8.29) 


we have (8.21) for g = ce (and fr as in (8.20) for such a g). 
It holds 


Vfotnsq„(fi', b) = iZ^L'yLib))n. (8.30) 

Thus, if 

EQ>iZfrL'diLib*))^) = EQ^iZ'^L'idiLib*))^)<oo, i = l,...,l, (8.31) 
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and 


V msqib *) = {Z^VLib*)) = 0, (8.32) 

then, from Theorem 5, for 

'Lmsqib') = EQ,{L'Z^yLib*){yL[b*)f), (8.33) 

we have (8.21) for g = msq. 


Let us further in this section assume Condition 22 and let 


lnib',b) = 


U 

W) 


n e I 


(8.34) 


(see (4.26)). Consider now the case of g = msq2. We have msq2„ = msq„l„ and thus 

V^ins^„ = (Vi,insq„)T„ + rnsq„V^T„. (8.35) 


Let 


Ttib') = \/f11(A(f = fceN+)(Vi,msq^-(h',h*) + msq(h*)Vi,Tfc(h',h*))(Kfc) 
= sfn{Nt = ke N+)(L'VL(h*)(Z2-msq(h*)L(h*)-2))^(7ffc) 


(8.36) 


and 


Zdb') = ^mNt = ke N+)Vbmsq2fc(h', b*)[kk) - Tdb') 

= UNt = ke N+)(((Tfc- l)\/fV;,rnsqfc)(h', b*) (8.37) 

+ (insq^(h', b*) -msq(h*))\/fVbTfc(h', b*)){kk). 

Let 

0 = EQ^[ybiL-\b*))) (8.38) 

(see the first point of Theorem 123 for sufficient conditions for this in the LETS setting) and 
Eqj (L'L(h)“^(d;L(h))^) < oo, i = 1. 1. Then, from Theorem 5, 

v/fKATf = fceN+)VbTfc(h',h)(Kfc)^M^(0,M(h')EQj(l'l(h)“^VL(fi)(VI(h))^)). (8.39) 

Assuming further (8.31) and (8.32), from (8.21) for g = msq, (4.26), (8.39), the fact that from 
the SLLN and Condition 162, a.s. HiATf = ke N+)insqj.(h', h*)(Kfc) msq(h*), as well as from 
(8.37) and Slutsky’s lemma, 

Zf(h') - 0. (8.40) 
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Let 

^msq2ib') = EQ^iL'{Z^-msq[b*)Lib*)-YyLib*){yL[b*)f). (8.41) 

From Theorem 5, Tfib') ^ ^(0, uib')I.msq 2 ib']), so that from (8.40), the first line of (8.37), and 
Slutsky’s lemma, we receive (8.21). 

Let us finally consider the case of g = ic. We have for u e N 2 

V^ic„ = (Vi,c„)t%„ + c„V;,var„, (8.42) 

where 

Vbrar„ = —^((V;,rnsq„(h',h))T„ + insq„(h',h)Vi,T„) (8.43) 

n — l 

Let for D= [Ux K^)^ x |R and n e N+, t/„(h'): Q" — D be equal to 

icnib', b*),ybCnib', h*),msq„(h', h*), V^tnsq„(h', h*),T„(h', h*), Vi,T„(h', b*),{Zir)n) (8.44) 
and let 


d := ic{b*),Vc{b*),msqib*),Vmsq{b*), 1,0,a] = E^dUiib’)] e D. (8.45) 

Let the coordinates of U\ [b') be square-integrable under Q' and 

'P := Eq.(( t/i (h') - 0)(t/i [b') -ef). (8.46) 

Then, from Theorem 5, 

sfniNt = ke N+)(t/fc(kfc) -0) ^ J^{0, u(h')'P). (8.47) 

For (/): D — such that 

= X2(X3X5 - X7) + Xi (X4X5 + X3X6), ( 8 . 48 ) 

we have for n e N 2 

yb(cn{b',b*) = -^(l)[Unib')) (8.49) 

n-l 

and Vic(h*) = Let us assume that 

Vic(0*) = 0(0) = O. (8.50) 

Using the delta method from Theorem 249, as well as Remark 251 and (8.47), we receive that 
for 

Zic(h')=(/)'(0)'P(0'(0))^ (8.51) 


128 



8.6. A helper CLT 


we have 

^mNt = ke M+)(p{Uk[b'){kk)) => ^(0, M(h')Zic(h')). (8.52) 

From lim„^oo 7 ^ = li (8.49), (8.52), and Slutsky’s lemma, we thus have (8.21). From (8.46) and 
(8.51), for W[b’) := 0'(0)(t/i(h') - d) 

Zic(h') = EQdW(h')lV(h')^). (8.53) 

We have 

(/)'(0)((xdj=i) = Vmsq(h*)xi +var(h*)X2 + Vc(h*)X3 

+ c(h*)X 4 + (Vc(h*)msq(h*) + c(h*)Vmsq(h*))X 5 (8.54) 

+ c(h*)msq(h*)X 6 - 2 Vc(h*)ax 7 , 


so that 

W(h') = Vmsq(h*)(CL'L(h*)“^-c(h*)) + var(h*)(-CL'L“^(h*)VL(h*)-Vc(h*)) 

+ yc[b*){Z^L'L[b*)-msq[b*)) + c{b*){Z^L'yL{b*)-ymsq{b*)) 

, , (8.55) 

+ (Vc(h*)msq(h*) + c(h*)Vmsq(h*))(Z, L ^[b*) - 1) 

- c{b*)msq{b*)L'Lib*r^yL[b*)-2yc{b*)a{ZL' - a). 

Remark 257. Let us make assumptions as above and that C = 1. Then, we have c{b) = 1, 
Vc(h) = 0, andicib) = var(h), be A, and from (8.50), Vmsq(h*) = Vvar(h*) = 0, 50 that from 
(8.55), 


W{b') = L'yLib*)iZ^ - (var(h*) + msq(h*))L(h*)“2), (8.56) 

and thus 

Zicib’] = EQ^{LfZ^ - (varib*) + msq{b*)]Lib*r^fyLib*)iyL{b*]f). (8.57) 

8.6 A helper CLT 

For some Z e N+, consider a nonempty set A e S8{U^) and a corresponding family of proba¬ 
bility distributions as in Section 3.4. Let meN+, u:5^iA)05^i ^ and B e SBiA] be 

nonempty. 

Condition 258. For 

fib,M)■.= EQ^b)(.\u{b,■]m\uib,■]\> M)), beB,MeR, (8.58) 

andR[M) := supb£gf{b,M), M eU, we have 

lim RiM) = 0. (8.59) 

M^oo 


129 



Chapter 8. Asymptotic properties of minimization methods 


Note that the above condition is equivalent to saying that for random variables y/b ~ Q{b), 
b e B, the family {\uib,y/b)\ : b e B} is uniformly integrable. In particular, similarly as for 
uniform integrability, using Holder’s inequality one can prove the following criterion for the 
above condition to hold. 

Lemma 259. If for some p>\, 

supEQ(i,){|u(h,-)l^) <oo, (8.60) 

b&B 

then Condition 258 holds. 

For some T e N+ u oo, K-valued random variables are said to be martingale differences 

for a filtration if V'n neN, is a martingale for that is if E(|i//d) < oo, 

if/i is .^/-measurable, and ECi///].^/-!) = 0, / = 1,..., r. The following martingale CLT is a special 
case of Theorem 8.2 with conditions II, page 442 in [39]. 

Theorem 260. For each n e N+, let m„ e N+, be a filtration, and be mar¬ 

tingale differences for it such thatEiy/^^ <oo, k = I,..., m„. Let further 

1. foreach5>Q,Y,'^2ft{yf'^^ jf^{\y/n,k\>5)\^n,k-i)^b (asn^oo), 

2. forsomeae [0,oo), ^ tr^- 

Then, 


rrin 


fc=l 


(8.61) 


Let r ■. 5^ (A) 0 ^ 5^ (IR'”) be such that for each be A, 


Ei}{b)(.r{b,-]) = 0. 


(8.62) 


Consider a matrix 


nb) = EQ(b)irib,-)r[b,Y) (8.63) 

for be A for which it is well-defined. Note that under Condition 258 for u=\r\^, from (8.59) 
we have R[M) < oo for some M> 0, and thus R[0) < M+ R[M) < oo and Z(h) e IR'”’"”, beB. 

Theorem 261. Let us assume that Condition 135 holds and we have 

lim njc = oo. (8.64) 

k^oo 

Let further B as above be a neighbourhood ofb* e A, 

bn^b*, (8.65) 
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Condition 258 hold for u=\r\^, and b e B ^ Zib) be continuous in b*. Then, 

I rile 

X nbk-i,Xk,i) ^ ^(0,Z(r)). (8.66) 

\^i=l 


Proof Let Wfc = r[bk-i,Xk,i) and Dt = Khfc-i e B)Wic, keN+. From (8.65) we have 

P 

Wjc- Dk = Tiibk-i € B)Wk 0, so that from Slutsky’s lemma it is sufficient to prove that 
Dk ^ JT{0,Z{b*)). Furthermore, using Cramer-Wold device it is sufficient to prove that for 
each t e for vib) := t^Z{b)t, be B, and Sk := t^Dk, we have Sk ^ ^(0, v{b*)). For t = 0 
this is obvious, so let us consider f 5 ^^ 0. It is sufficient to check that the assumptions of Theorem 
260 hold for 3^k,i = Xkj ■ j ^ D, Vk,i = e B)fr{bk-i,Xk,i), and 

= v{b*). From Condition 258 (for u = |r|^). 


k^k 

|f|2 

< —E((EQ(i,) (Kfi e B) I r (fi, ■)f])b=b,J 
f^k 

\t\^ 

<—mibk-ieB]f{bk-i,0)) 

nk 

\t\^ 

< —i?(0) < 00 . 

Hk 


(8.67) 


For 5 > 0, from Condition 258 and (8.64), 

nik 

X m\imvfk,i\> 5)\:^k,i-i) = (Kh e 5 )Eq® (I z:^r(fi,-)l^l(z:^r(fi,-) > ^/Tii5)))b=b^_, 

1=1 

< |t|^(11(fie5)Eq,(i,)(|r(fi,-)l^'D(|t||r(h,-)l > s/hf5)))b=bk 1 

S ( 8 . 68 ) 

= |t|^l(hfc_i eB)/(fifc_i,nfc(—) 2 ) 

<\tf-R[nk{—f)-*b, k^oo. 

\ 

To prove the second point of Theorem 260 let us notice that 

rrik 

X i\t^k,i-i) = ncfi e 5)EQ(f,)(I fr{b, ■)\^))b=bk 1 

1=1 ’ (8.69) 

= meB)vmb=bk-^-vib*), 

where in the last line we used (8.65) and the continuity of fi — 11(fi e B) v{b) inb*. □ 

We will he most interested in the IS case in which we shall assume the following condition. 

Condition 262. For some -valued random variable Y on Condition 17 holds for Z 
replaced by Y, and we have 

rib,ct)) = {YLib)]{a)), beA,a)eD.i. (8.70) 
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Lemma 263. Let us assume Condition 262 and thatforP = 11(F O)L(h) we have 

Eqi(|F|2f)<oo. (8.71) 

Then, Condition 258 holds for u=\r\^. If further <Qi a.s. b — Lib) is continuous, then be B 
Z(h) is continuous. 

Proof For each M eU and be B 

fib,M) = EQib)i\YLib)\hi\YLib)\ > 

= Ei}^i\Y\^Lmi\YLib)\ > '/M)) (8.72) 

<Eq^i\Y\^F%\YF\> VM)), 


so that 

RiM) < Eqj (I FI^FKI FF| > \/M)) < Eq^ (| FI^F) < oo. (8.73) 

From (8.71) we have 0 = P(| F|^F = oo) = P(| F|F = oo). Therefore, as M — oo, we have 11(| FF| > 
\/M) — 0 and thus from Lebesgue’s dominated convergence theorem also Eq^ (| FI^FKI FF| > 
\fM)) — 0 and RiM) — 0, i.e. Condition 258 holds. We have 

Z,j(h) = EQ(i,)(F,-FjL(h)2) = EQ^(F;FjI(h)) (8.74) 

and \ YiYjLib)\ < |F|^F, b e B. Thus, the continuity of b e B ^ Y.ijib) (and thus also of 
be B ^ Z(h)) follows from Lebesgue’s dominated convergence theorem. □ 

Remark 264. From Theorem 121 and Remark 119, in the LETS setting, for B bounded, under 
Condition 262, and for F as in Lemma 263, (8.71) holds if Condition 115 holds for S = |F|^. 

8.7 Asymptotic properties of multi-stage minimization methods 

Consider the following conditions. 

Condition 265. We have d* e Ae SSiU^) and A-valued random variables bk, keN, are such 
that 


bk^d*. (8.75) 

Remark 266. From Remark 170 and its counterparts for CGMSM and MGMSM as discussed 
in Remark 204, under conditions 165 and 167 for EMSM and GMSM or their counterparts for 
CGMSM and MGMSM as discussed above Remark 204, as well under Condition 169, we have a.s. 
for a sufficiently large k,bjc = d^ for EMSM and GMSM, and b]c = d^ for GGMSM and MGMSM. 
In particular, a.s. lim^^oo bjc = b* and thus Gondition 265 holds for d* = b*. 

Let us further in this section assume conditions 135 and 265. Using analogous reasonings and 
assumptions as in Section 8.5, but using CLT from Theorem 261 for b* replaced by d* instead 
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of Theorem 5, we receive that under the appropriate assumptions as in that theorem we have 
for g in the below formulas substituted by ce, msq, or ic, that for/fc(h) = gnk(^k-iyb)iXk) 

s/T^yfkib*)^jr{0,^g{d*)). (8.76) 

To prove (8.76) for g = msq2 using a reasoning analogous as in Section 8.5 we additionally 
need the facts that msq„j,(hfc-i>f^*)(Tfc) msq(h*) and ln^{b^-\,h*){xk) 1- Under appro¬ 

priate assumptions, such convergence results follow from the convergence in distribution 
of v^(nisq„Jhfc_i, h*)(jfc) - msq(h*)) and yJn^iXkbbk-i, b*)iXk) - D- which can be proved 
using Theorem 261 as above. For different g as above, assuming (8.76) and that Condition 245 
holds for r = N+, r,t = fk as above, and / corresponding to g as in Remark 256 (see Section 
8.4 for some sufficient assumptions), for Vg and Bg as in that remark we have from Theorem 
237 


s/Wj^[dk-b*)^^{0,Vg{d*)), 


(8.77) 


and from Remark 252 


nkifidk) - fib*)) ^ x^iBgid*)). (8.78) 

If for Sk denoting the number of samples generated till the kth stage of MSM, i.e. Sk '■= Lf=i nt, 
k e N+, we have 

lim — = 7 e [ 1 , 00 ), (8.79) 

k^oo 


then from (8.77) it follows that 


s/sk^dk -b*)^ J/'[0,YVg[d*)), 
while from (8.78) — that 


(8.80) 


Skifidk) - fib*)) ^ x^ijBgid*)). (8.81) 

For instance, for nk = Ai + A 2 mf as in Remark 149, we have Sk = AikA 2 , so that 
7 = 7 ^. For H/t = Ai -H A 2 k! as in that remark, we have 


Sk ^ Sk _ Ai I I ^ I ^ I ^ ^ ^1 I 1 I ^ 

Uk A 2 M A2(fc-1)! k kik-l) ' ' k\ A2(fc-1)! k’ 


(8.82) 


so that 7 = 1 . 

Remark 267. Let us assume that (8.81) holds and let ps^ = dk, fc e N+, i.e. p is the process of 
MSM results but indexed by the total number of the generated samples rather than the number 
of stages. Consider now dk as in SSM in Section 8.5 for T = N+, Nk = k, k e T, and b' = d*, 
so that we have (8.19) for u{b') = uid*) = 1. Let us assume that (8.25) holds for such dk, and 
letp's^ = ds^, fc e N+. Let further T = {Sk'.ke N+}. Then, tifipd - fib*)) ^ x^ijBgid*)) and 
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fib*)) ^ t ^ oo, t e T. Thus,fory= 1 orBgid*) = 0, the processes p and 

p' can be considered asymptotically equally efficient for the minimization off in the second- 
order sense as discussed in Section 8.3, while for j > 1 and Bg[d*) 0 the process from SSMcan 
be considered more efficient than the one from MSM in such sense. 

Remark 268. From the discussion in Section 8.3 for T = N+ and rjc = njc, k e T, for R ~ 
X^iBgid*)), the second-order asymptotic inefficiency for the minimization of f of processes 
d = (dfc) fc£N+ from MSM satisfying (8.78) like above, e.g. for different g or d* but for the same b* 
and njc, can be quantified using 

E{R) = ^TriI.g{d*)HJ^). (8.83) 

Let (8.81) hold and consider a process ps,, = keN+, as in Remark 267. Then, the asymptotic 

inefficiency of p for the minimization off can be quantified using 

j^TriZgid*)HJ^). (8.84) 

Using (8.84) one can compare the asymptotic efficiency of such a process p from MSM with that 
of a process p' from SSM as in Remark 268, but this time without assuming that b' = d*, so 
that the inefficiency of p’ is quantified by^Tx{'Lg{h')Fl^^). In particular, ifYTx{l.gid*)Hj^) < 
Tr{l.g{b')HJ^) then p can be considered asymptotically more efficient for the minimization off 
than p'. 

Consider further the mean theoretical cost u of MSM steps, analogous as in Remark 255for 
SSM. For two MSM processes d as above for which u is continuous in the corresponding points 
d*, (8.83) is positive and not higher for the first process than for the second one, and u{d*) 
is lower for the first process than for the second one if the constants pq as in Remark 163 for 
these processes are the same, or the mean practical cost pqu{d*) of this process is lower if these 
constants are different, it seems reasonable to consider the first process asymptotically more 
efficient for the minimization off. More generally, by analogy to formula (8.26) for SSM, rather 
than using (8.83), one can quantify the asymptotic inefficiency of MSM processes d as above by 

u{d*)^Tr[I.g{d*)HJ^), (8.85) 

or such a quantity multiplied by pq respectively. 

A more desirable possibility than having Condition 265 satisfied for d* = b* as discussed in 
Remark 266 (where for the minimization methods from the previous sections such a b* is 
equal to the unique minimum point of the minimized function /), may be to have it fulfilled 
for d* minimizing some measure of the asymptotic inefficiency of djc for the minimization of 
/, like (8.83) or (8.85) (assuming that such a d* exists). See Chapter 11 for further discussion 
of this idea. From Remark 278 in Section 8.9 it will follow that for g = msq, the minimum point 
of msq does not need to be the minimum of (8.83) in the function of d*. 
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8.8 Asymptotic properties of the minimization results of the new es¬ 
timators when a zero- or optimal-variance IS parameter exists 

Condition 269. We have the assumptions as in Section 7.11 above Condition 222, D as in that 
section is a neighbourhood ofb*, conditions 216 and 222 hold, and for each b’ e A, ke Np, and 
(I) e Qj, b — ^tjcib', b) (w) is differentiable in b*. 

Theorem 270. Let conditions 32 and 269 hold, T cU+ be unbounded, Nt, t e T, beNu {oo}- 
valued random variables, and b e B ftib) ■= 'i{Nt = k e Np)^tkib', b){kic), t e T. Then, it 
holds a.s. 

yft{b*) = 0, teT. (8.86) 

If further Condition 245 holds for such ff, then 

s/rt{dt-b*)^Q, t^oo. (8.87) 

Proof From Lemma 224 we receive (8.86). Thus, (8.87) follows from Theorem 246. □ 

Theorem 271. Let Condition 135 hold for n^ e Np, k e N+, let Condition 269 hold, and let 
be B ^ fkib) := ^tn^[bk-i,b)ixk)> keT:= N+. Then, we have a.s. (8.86). If further Condition 
245 holds for such fk, then (8.87) holds. 

Proof. From Lemma 227 we have (8.86), so that (8.87) follows from Theorem 246. □ 

Remark 272. Let conditions 18, 22, and 23 hold, A be a neighbourhood ofb*, and b — L{b) (a») 
be differentiable, to e Qi. Let^tn be equal to msq2„ and b* be an optimal-variance IS pa¬ 
rameter, or^tn be equal toicn andb* be a zero-variance IS parameter. Let further for SSM 
Condition 32 hold and T and Nt, t e T, be as in Theorem 270, while for MSM let Condition 
135 hold for nk e Np, k e T := N+, for p = I for^tn = msq2„ or p = 2 for^tn = icn. Then, 
from remarks 219, 223, and the above theorems, we have a.s. (8.86) for ff as in Theorem 270 
for SSM, and as in Theorem 271 for MSM. If further Condition 245 holds (see Section 8.4 for 
some sufficient assumptions), then we also have (8.87) in these methods. If we have (8.87) for 
rt growing to infinity faster than t for SSM or than nt for MSM, i.e. such that lim ^ 0 or 

lim ^ ' 0 respectively, then we have in a sense faster rate of convergence ofdt to b* than in 
Section 8.5 for SSM or in Section 8.7 for MSM respectively. 

8.9 Some properties of the matrices characterizing the asymptotic 
distributions when a zero- or optimal-variance IS parameter ex¬ 
ists 

Let us further in this section assume conditions 22 and 23. Consider matrix-valued functions 
Zg, Vg, and Bg, given hy the formulas from Section 8.5 and considered on the subsets of A on 
which these formulas make sense. 

From the reasonings in Section 8.5, for each b' e A, under Condition 32, for g replaced hy msq2 
oric, for/„(h) = gnib',b)[k„), n e N+, under appropriate assumptions Zg[b') is the asymptotic 
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covariance matrix of s/nV fnib*). Under appropriate assumptions as in Remark 272, including 
b* being an optimal-variance IS parameter in the case of g = msq2 or a zero-variance one 
for g = ic, we have a.s. yfkib*) = 0, ke N+. Thus, in such cases l.gib') = 0, h' e A. This can 
be also verified by the following more generally valid direct calculations. For b* being an 
optimal-variance IS parameter, from (3.12), Qi a.s. (and thus from Condition 22 also Q(h) a.s., 
be A) 


{ZL{b*]f = msqib*), (8.88) 

and thus from (8.41), Zmsqz = 0- Let now b* be a zero-variance IS parameter. Then, we have 
var(h*) = 0 and under appropriate differentiability assumptions V msq(h*) = 0. Using further 
(8.88) and the fact that Qi a.s. ZL{b*) = a, from (8.55) we receive that for each b' e A, Q' a.s. 

Wib') = yc[b*)[Z^L'Lib*)-msqib*) + msq(h*)(l'l“^(h*) - 1) -2a(Zl' - a)) 

* * , ? * * ? (8.89) 

-H cib*)yLib*)L'iZ^-msqib*)L[b*)~n = 0. 

Thus, from (8.53), Zic = 0. Using the notations as in Remark 256, if H^sq is positive definite for 
g = msq2 or Hie is positive definite for g = ic, and we have Zg = 0 as above, then it also holds 

^g-^g- 8 - 

Let us further use the notations pib), lib), and Sib) as in Section 8.5. We define the Fisher 
information matrix as 


Iib) = EQib)iSib)Sib)^), be A (8.90) 

(assuming that it is well defined). It is well known that under appropriate assumptions, 
allowing one to move the derivatives inside the expectations in the helow derivation, we have 

/(h) = -EQ(b)(v2Z(h)). (8.91) 

The following derivation is as on page 63 in [55]. From Eqj ipib)) = I, be A, we have Eqj (V pib)) = 

O ^7^ £? i b') 

0,beA, and 0 = Eq^ iV^pib)) = Eq(j,) i—f^^), be A, so that taking the expectation with respect 
to Qib) of 

, y^pib) Vp(h)(Vp(h))^ 

y^iib) = —(s.92) 
pib) p^ib) 

we receive (8.91). 

Let us define 

Lib) T 

R(fl,h) = EQ(a)(-—S(fl)S(fl)^), a,beA, (8.93) 

Lia) 

Note that 


Rib, b) = Iib). 


(8.94) 
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Remark 273. For some a, b e A, letR{a,b) and I {a) have real-valued entries. Then, R{a,b) 
is positive definite only if for each e IR^ v 7 ^ 0 , E(Qi(a) [j^\S^{a)vf) > 0, which holds only if 
Q(a)(|S^{a)i;| 7 ^ 0) > 0, e IR^ 7 ^ 0, and thus only if I [a) is positive definite. 

Let b* be an optimal-variance IS parameter. Then, from ( 8 . 88 ) 

Zce(^) = EQfiZ^L{b)Sib*)Sib*f) 

L(b] 

= msq{b*)EQf—^S{b*)S{b*f) (8.95) 

L{b Y 

= msqib*) R[b* ,b) 

and 

Zmsq(fe) = EQi {Z^Lib)L[b*fS[b*)S[b*f) 

= msq{b*]\Qfi^^Sib*)Sib*f) (8.96) 

L[b Y 

= msqib* fR{b* ,b). 

For the cross-entropy, let us assume that b* is a zero-variance IS parameter, so that Qi a.s. we 
have ZLib*) = a. Then 

ce(h) = -Eqj [Zim = -aEQj iUb*r^im. (8.97) 

Thus, assuming that one can move the derivatives inside the expectation 

V^ceih) = -aEi^fiLib*r^y^im = (8.98) 

in which case from (8.91) 

Hce = y^ceib*) = alib*). (8.99) 

Assuming that lib*) is positive definite and a 7 ^ 0, from msq(h*) = a^, (8.22), (8.95), and (8.99) 
we have 

Vceib) = Hf^^'Lceib)Hf^^ = Iib*)-^Rib*,b)Iib*)-\ (8.100) 

which is positive definite from Remark 273. 


Remark 274. Under the assumptions as above, from (8.94) and (8.100) we have Vceib*) = 
Iib*)~^. Note that this is the asymptotic covariance matrix of maximum likelihood estimators 
for b* being the true parameter, see e.g. page 63 in [55]. This should be the case, since under 
Condition 32 for b' = b*, a.s. 


cenib* ,b)iKn) = iZLib*)lniLib)))niKn) 
= -alnipib)))„ikn), 


( 8 . 101 ) 
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and b — ln{p(h)))„(K„) = L”=i ln(p(h){K,)) is maximized in such maximum likelihood esti¬ 
mation (note that for a >0 we should minimize ( 8 . 101 ) while for a < 0 — maximize it). 


Under the assumptions as above and for a > 0, from (8.24), (8.95), and (8.99) 

Bce(h) = = ^aIib*)-^^Rib*,b)nb*)-K (8.102) 

which is positive definite. Note that Bceib*) = \ali and Tr(Bce(fi*)) = 

For the mean square, let us assume that b* is an optimal-variance IS parameter. Then, from 

( 8 . 88 ) 


msq(fi) = E(Q,j (Z^L(fi)) = msq(h*)EQj 

LtyU ) 


(8.103) 


Thus, assuming that one can move the derivatives inside the expectation we have 


V^msq(h) = -msq{b*)VEQ^{^^^^S{b)'^) 

= msq{b*)EQ^{-^^{S{b)S{bf-V^im). 

L{b Y 


(8.104) 


In such a case, from (8.91), 


Ub) 


H^,^ = msq(b*)Eii^lj^{S{b*)S{b*Y -V"Z(h*))) 
= msq(fi*)27(h*). 


(8.105) 


Remark 275. Under the appropriate assumptions as in Remark 29, when b* is a zero-variance 
IS parameter, then, ford denoting the cross-entropy distance, d((Q)(h*),Q(h)) andce are 
linearly equivalent with a linear proportionality constant a (see (4.8)). Thus, in such a case 
(8.99) follows from the well-known fact that under appropriate assumptions 

{yldmb*),Qmb=b- = lib*). (8.106) 

Furthermore, from Remark (30), when b* is an optimal-variance IS parameter, then, for d 
denoting the Pearson divergence, b — d(Q(fi*),Q(h)) andmsq are linearly equivalent with a 
linear proportionality constant msqib*) (see (4.11) and (3.12)). Thus, in such a case (8.105) is 
equivalent to the fact that 

(V2d(Q(fi*),Q(h))i,=fa. =2/{r). (8.107) 


Assuming that Hb*) is positive definite andmsq(fi*) 9 ^ 0, from (8.22), (8.96), and (8.105), 


Vmsqib) — iTjjjgqZmsq(h)fimsq 

= -Iib*)~'^Rib*,b)Iib*)~\ 
4 


(8.108) 
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zero- or optimal-variance IS parameter exists 


which is positive definite from Remark 273. Thus, assuming further that b* is a zero-variance 
IS parameter and we have (8.100), it holds 

Vmsqib) = ^Vceib). (8.109) 

From (8.108), we have Vmsq(h*) = |/(h*)“^ Furthermore, from (8.24) 


Bmsqib) — ^ F/jy,sqZjnsq(fi)^fi 


-‘msq 


msq(h* 


( 8 . 110 ) 


-Iib*)~^^Rib*,b)I{b*r~^, 


which is positive definite. Note that Rmsq(fi*) = 


_ msq(b*) 


/,andTr(Bmsq(fi*)) = 


_ fmsq(b*) 


Remark 276. Let A, T, dt, and tt be as in Section 8.1, let be A, and u: A—>■ U be such that 
u{b) elR+ andttidt-b*) ^ jV'{Q,u[b)Vce{b)) (see Section 8.5 for sufficient conditions for this to 
hold for b = b' and rt = t for the SSM of the cross-entropy estimators and Section 8.7 for b = d*, 
u{d*) = 1, and r^ = nj^for theMSM of such estimators). Let further msq be twice differentiable 
inb* withy msqib*) = 0 and= msqib*). Then, from Remark251, for 

Bce.msqib) 1 = ^ V'ce(fi) ^ Ffmsq Vce(fi) ^ , ( 8 . 111 ) 

we have 

rf(msq(df)-msq(h*)) ^ X^^u{b)Bce,msq{b)). (8.112) 

Forb* being a zero-variance IS parameter, I{b*) being positive definite, anda 0,from (8.111), 
(8.105), (8.100), and (8.110) 

Bce,msqib) = msq{b*)I[b*)~^R{b*,b)Iib*)~^ 

(8.113) 

= 4BiY[sq(h). 

Remark 277. Consider the LETS setting and let Condition 115 hold for S = 1. Then, from 
E<Q{a)i\j^Siia)Sj{a)\) = EQfi\j^Siia)Sj[a)\), Theorem 121, andRemark 119, wehaveR{a, b) e 
h e A, and thus also lib) eU^^fbe A. Furthermore, from Theorem 122, the above deriva¬ 
tion leading to (8.91) can be carried out. From these theorems and remark we can also move the 
derivatives inside the expectation in (8.104), and using analogous reasoning as in the proof of 
Theorem 122 - also in (8.98). 


Consider now the ECM setting as in Section 5.1, assuming Condition 36. Then, from (8.90) 
lib) = EQ(i,) ((X - pib)) ix - pib)) ^) = nb). (8.114) 
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Furthermore, 

L{h) r 

R{a,b) = EQ^{-^S{a)S{af) 

^ L{aY 

= exp('F(h) - 2'F(fl))E(Q,^ (exp({2a - b) [X - ^[a ))(X - p(fl)) 

= exp('F(h) + ¥( 2 fl-h)- 2 'P(a))EQ( 2 a-b)((X-p(fl))(X-p(a))^) 

= exp('F(h) + 'P(2fl- h) - 2'P(a))(Z(2fl- h) 

+ (p(2fl- h) -p(a))(p(2a- h) -p(fl))^). 

For a positive definite matrix B e Sym;(IR) and b e IR^ let \b\B = Yb^Bb. Then, | • I 5 is a norm 
on IR^ For X ~ M) under Qi for some positive definite covariance matrix M, we have 

from (8.115) and discussion in Section 5.1 

R{a,b) = exp{|fl-h|^)(M + M(fl-h)(fl-fi)^Xf). (8.116) 

In particular, for b* being a zero-variance IS parameter, from (8.100) and (8.109) we have 

I/ce(fi) = expdfi* - h|^)(M-l + [b* - b){b* -bf)= AVmsqib), (8.117) 


and 

Bmsqib) = expd b* - bill) Hi + ib* - b) [b* - b) ^M^). (8.118) 

Thus, the mean of T^(-Bmsq(fi)) is 

Tr(Binsq(fi)) = ^ expdh* - b\l^)il + \b* - h|^). (8.119) 

Note that b* is the unique minimum point of h — Tr(5msq(h))- From the helow remark it 
follows that for X having a different distribution under Qi this may be not the case. 

Remark 278. Consider theECM setting for A = R. Then, R{a, b) = EQ[a)ij^iSia)Y) and 

[VbRici, b))i,=a = -^(iia)iS{af) 

( 8 . 120 ) 

= -EQ(a)((X-p(fl))^). 

From the convexity ofb — R{a, b) (which follows from the convexity ofb — L{b)), a necessary 
and sufficient condition for b = a to be its minimum point (and thus for a zero-variance IS 
parameterb* = a to be the minimum point of b Bmsqib) asin (8.110)), isthatX hasazero 
third central moment under Q[a). This does not hold e.g. forX ~ Pois(/i(a)) under <Q[a) as in 
Section 5.1, for which £(}(«) ((X - p(a))^) = p{a) > 0. 

In the LETGS setting, assuming Condition 53, that for some b e A, G has Q(h)-integrable 
entries, and that we have (8.91), from (5.43) 


lib) — 2Eiq,(fo) (G), 


( 8 . 121 ) 
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which from Lemma 79 is positive definite only if Condition 76 holds for Z = 1. 


8.10 An analytical example of a symmetric three-point distribution 


Let us consider ECM as in Section 5.1 for / = 1, assuming that QiiX = -1) = Qi(X = 0) = 
Qi(X = 1) = |. Then, we have A = K, <5(fi) = = d)(fi)exp(-fiX), 'T(fi) = ln(d>(fi)), 

li{b) = V'P(fi) = and For some d e K, let Z = = 0) + dl(X 0). 

We have a = and 

msq(« = Eq, = 1,1. . <i^(/ . ,8.122, 

The unique minimizer of msq is 0, which corresponds to crude MC. It holds msq(O) = i±|^ 
and var(O) = - (F!^)2 = | (1 - d)^. Thus, there exists a zero-variance IS parameter only 

if d = 1, in which case such a parameter is b* := 0. We have Eqj (ZX) = 0, so that from (6.9), 
ce(fi) = a'T(fi). Thus, if d 7 ^ then ce has a unique optimum point 0 = b*, which is a 
minimum point if a > 0 and a maximum point if a < 0. We have V^'P(0) = | and 

o o 2 ( 1 - 1 - 2 d) 

//ce = v2ce(0) = aV2'P(0) =- - -. (8.123) 

It holds VL(fi) = -Xexp[-bX)(S>ib) + expi-bX) ‘^‘’~^ , so that VL(0) = -X. Thus, from (8.29), 
for 


gib] := a + e^ + e~'’)ie^ + e~^), 


we have 


Zceib) = EQ^iZ^Lib]X^) = 



and for d 5 ^ -1 

Vceib] = H-^llceib] = 


3d 

, 2 ( 1 - 1 - 2 d) 



We have 

fimsq = msq(O) = ^(1-H5d^), 
from (8.33) 


'L^sqib) = EQ,iL[b]Z'^X^] = 


9 


gib), 


(8.124) 


(8.125) 


(8.126) 


(8.127) 


(8.128) 
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and thus 


^4nsq(h) — /ijfjsqZnisqCh) — 


.2(1 +5^2)J 


From (8.41) 


2^msq2(h) — Eqj ((2 


msq[0) fL(b)X^) = ^(d^-lfg(b), 


so that 


i^sq2(h) — F/j^5qZixisq2 (h) — 


d^-1 ] , 

.2(l + 5d2)J 


(8.129) 


(8.130) 


(8.131) 


From (8.57) 

lic(b) = Eq^ ((z2 - msq(O) - var(0))2L(h)x2) = ^ (W - D+ 5)fg(b), (8.132) 

and thus 


Vic(b) = H^l^Z-Ab) 


(id-Did+ 5) 
. 6(1+ 5^2) 



(8.133) 


For 5 substituted by ce, msq, msq2, and ic, let us further write Vsib,d) rather than Vsib) 
to mark its dependence on d. Let for such different s, fsid) = 2(1 + so that 

fceid) = I l> fmsqid) = 3d^, fmsqzid) = \d^-l\,and ficid) = l\id-Did+5)\. Fordifferent 

V f 

substitutions of si and S 2 as for s above, it holds ib, d) = i^id))^ whenever fs, id) 7 ^ 0. The 

Js 2 

graphs of functions fs for different 5 are shown in Figure 8.1. Note that b — gib) is positive and 
has a unique minimum point in 0 = h*. Thus, if fsid) ^ 0, then b -» Vsib, d) = 
also has a unique minimum point in b*. It holds fceiO) = /msq(O) = 0. For d eU \ {-|,0}, let 
rid) = id) = I I • easily show that this function assumes a unique minimum 

in d = 1, in which rid) = 2. In particular, for d e IR \ {-^} we have fceid) > 2 fmsqid) and thus 
Vce ib,d)>4 f^rnsq ib,d),beM., with equalities holding only for d equal to 0 and 1, the latter being 
in agreement with the theory in Section 8.9 since for d = 1, h* is a zero-variance IS parameter. 
From an easy calculation we receive that fs^ < fs 2 on D, fs^ = fs^ on dD, and fs^ > fs 2 otherwise 
for 5i = msq, 52 = msq2, and D = (-^, ^), si = msq, S 2 = ic, and D = ( ~2~2^ , ~^'(^'^ ), as well 
as Si = msq2, S 2 = ic, and D = (-2,1). Since for 5 equal to msq2 or ic, fs is continuous and 
fsiO) > 0, such fs are higher than fee (and thus also than /msq) on some neighbourhood of 0. 
Note that in agreement with the theory in Section 8.9, for d = 1, for which b* is a zero-variance 
IS parameter, for s = ic, Vsib, d) = 0, h e R, and for 5 = msq2, this holds also for d = -1, for 
which b* is an optimal-variance IS parameter. For Bsib, d) = | Vsib, d)Hmsq for s equal to msq, 
msq2, or ic, and for Bee,msq(h, d) = | Veeib, d)idmsq for s = ce, we have analogous relations as 
for the different Vsib, d) above. 
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Figure 8.1: Functions fs as in the main text for s equal to ce, msq, msq2, and ic. 
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Two-stage estimation 


In this chapter we shortly discuss a two-stage adaptive method for estimation, in the first 
stage of which some parameter vector d is computed, with the help of which in the second 
stage an estimator of the quantity of interest a e IR is calculated. For a nonempty parameter 
set A e and a family of distrihutions as in Section 3.4, let a measurable function h : 

5^ [A) ^ 5^ (IR) he such that for each be A, 


EQ,^l,)ihibr)] = a. (9.1) 

Let var(fi) = Variq)(b) {h(b, ■)), be A. Consider some cost variable C and functions c and ic as 
in Section 4.1, but in the definition of ic using the above var (in particular, for ic we assume 
Condition 31). In the IS case conditions 17 and 214 hold, but the above setting is more general 
and can also describe e.g. control variates or control variates in conjunction with IS. The 
parameter vector d computed in the first stage is an A-valued random variable, which can be 
obtained e.g. using some adaptive algorithm like single- or multi-stage minimization method 
of some estimators described in the previous sections. This can be done in many different 
ways as discussed in the below remark. 

Remark 279. One possibility is to use as d some parameter dt corresponding to some first-stage 
budget t as in Remark 213. Alternatively, in methods in which to compute some parameter 
pic one first needs to compute pi, I = l,...,fc-l, like in the MSM methods from the previous 
sections for p^ as in Remark 213, one can setd = pj'iir 7 ^ oo) -t pooKr = 00 ) for some poo e A and 
somea.s. finiteH^ u { 00 }-valued stopping timer for the filtration = cr{pi :l <n}, neN+. 
For instance, ifa.s. pjc^ b* e A, then t can be the moment when the change of pk from pk-i, or 
such a relative change ifb* 0, becomes smaller than someee IR+ (see e.g. Remark 10 in [43]). 
For the various pk from MSM methods as in Remark 213, the fact that a.s. pk —' b* follows 
from Condition 167 or its counterparts. When for some functions fk :5^[A)^ ^(IR), 

keN+, (which can be e.g. the minimized estimators) andf: A — IR we have a.s. fkipC ^ fib*), 
then such a stopping time can be based on the behaviour offkipk), similarly as for pk above. 

loc 

Assuming that a.s. fk^ f (see Section 7.3 for sufficient assumptions), ifa.s. pk-^b*, then from 
Lemma 191 a.s. fkiPk) fih*). 

One way to model the second stage of a two-stage estimation method is to consider it on 
a different probability space than the first one, for a fixed computed value v of the variable 
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d from the first stage. In such a case, for some k\,K 2 ,..., i.i.d. ~ Qlf), in the second stage 
we perform an MC procedure as in Chapter 2 hut using Z; = h{v,Ki), i e N+, e.g. using a 
fixed number of samples or a fixed approximate budget. We can also construct asymptotic 
confidence intervals for a as in that chapter. From the discussion in Chapter 2, the inefficiency 
of such a procedure can be quantified using the inefficiency constant 10 ( 1 ;). This justifies 
comparing the asymptotic efficiency of methods for finding the adaptive parameters in the 
first stage of a two-stage method as above by comparing their first- and, if applicable, second- 
order asymptotic efficiency for the minimization of ic as discussed in sections 7.10 and 8.3. 
An alternative way to model the second stage of a two-stage method is to consider its second 
stage on the same probability space as the first one. Let us assume the following condition. 

Condition 280. Random variables (pi, 1 e N+, are conditionally independent given d and have 
the same conditional distribution Qib) given d = b. 

Condition 280 is implied by the following one. 

Condition281. Condition 19 holds, and for (ii, (^ 2 , i-i.d. ~ Pi and independent of d wehave 

(0/)i£N+ — (^(d,/3;))i£N+- 

In the second stage of the considered method one computes an estimator 
1 ” 

a„ = - ^ (9-2) 

Similarly as above, the number n of samples can be deterministic or random. In the first case 
the resulting estimator is unbiased, while in the second this needs not to be true. Random n 
can correspond e.g. to a fixed approximate computational budget and be given by definition 
(2.5) or (2.6) but for C; replaced by Ci(pi), 1 e N+. 

Remark 282. A possible alternative to the above discussed two-stage estimation method is the 
same as its second model above except that for the computation ofdn in the second stage one 
uses the variables 4>i = ^ id, fi) as in Condition 281 but without assuming that fi, ieN+,are 
independent ofd. In such a case. Condition 280 may not hold. For example, one could reuse 
the i.i.d. random variables with distribution Pi generated for the estimation ofd in the first 
stage as some (potentially all) the variables fi used for the computation ofdn, which could 
save the computation time. Under appropriate identifications, such an approach using exactly 
the same fi, i = \,...,n in ESSM to compute d and then (9.2) is used in the multiple control 
variates method (see [5, 22]), while for IS it was considered in [30]. In such a reusing approach, 
d„ as in (9.2) needs not to be unbiased even for n deterministic. Furthermore, one needs to store 
a potentially large random number of the generated values of random variables, which may be 
more difficult to implement and requires additional computer memory. Finally, in a number 
of situations, like in the case of our numerical experiments, generating the required parts of 
the variables fi forms only a small fraction of the computation time needed for computing the 
variables h{d, (pf = {ZL[d)){^ [d, ft)) in the second stage, so that reusing some from the first 
stage would not lead to considerable time savings. 
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Numerical experiments 


Our numerical experiments were carried out using programs written in matlab2012a and 
run on a laptop. Unless stated otherwise, we used the simulation parameters, variables, and 
IS basis functions as for the problems of estimation of the expectations mgf(xo), <7i,fl(xo), 
(j 2 ,aixo), and pr.aixo) in Section 5.9. In some of our experiments we performed the single- 
or multi-stage minimization of estimators ^t (where for short we write ^t rather than ^t„, 
n e Np, for appropriate p) equal to ce, msq, ms^, and ic, as discussed in Section 7.1. In the 
MSM we used in each case bo = 0. For the minimization of msq 2 and ic in these methods 
we used the matlab fminunc unconstrained minimization function with the default settings 
and exact gradients, for msq additionally using their exact Hessians, as discussed Section 
7.1. The minimum points of ce were found by solving the linear systems of equations as in 
that section. Both for the crude MC (CMC) and when using IS, the computation times of the 
MC replicates in our experiments were typically approximately proportional to the replicates 
of the exit times t for the MGF and translated committors, and to the replicates of t' as in 
(5.73) for pT,aixo)- Thus, we consider the theoretical cost variables C equal to Ht for the MGF 
and translated committors and to Ht' for pT,aixo). The proportionality constants of the 
replicates of such C to the simulation times as in Chapter 2) were different for CMC and when 
using different basis functions in IS. 

The remainder of this chapter is organized as follows. In Section 10.1 we discuss some methods 
for testing statistical hypotheses, which are later used for interpreting the results of our 
numerical experiments. In Section 10.2 we describe two-stage estimation experiments as 
in Section 7.1, performing MSM in the first stages and in the second stages estimating the 
expectations of the functionals of the Euler scheme as above. In the second stages we also 
estimated some other quantities, like inefficiency constants, variances, mean costs, and the 
proportionality constants P(^ as above. We use these quantities to compare the efficiency of 
applying in a IS MC method the IS parameters obtained from the MSM of different estimators, 
as well as of using different added constants a and IS basis functions in such adaptive IS 
procedures. In Section 10.3 we compare the spread of the IS drifts coming from the SSM 
of different estimators and using different parameters b'. In Section 10.4 we provide some 
intuitions behind the results of our numerical experiments. 
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10.1 Testing statistical hypotheses 

Let i^Xyl^Y £ o'x.o'F £ II?+> and K-valued random variables and Y„, n e N+, be sucb that 
X„ := y/riiXn- fix) ^ ^(0,cr^), Y„ := y/n{Yn- fir) ^ ^(0,crp, and for each j,ke N+, Xj is 
independent of F^. Let (Tx,n and dy.ny n e N+, be [0,oo)-valued random variables sucb that 
cix,n CTx and aY,n cry. Let Un, e N+, n e N+, be such that lim„^oo cin = lim„^oo bn = oo 
and 


Let 


. *7 

lim — = peR+. 

bn 


( 10 . 1 ) 


— 




^\an ^Y,b„ 


bn 


\/ dn bXa„ Yb„) 


^X.a„^ bn^Y.bn 


and Hn = 


_ v^(^y-rix) 

/a2 , £n *2 

bn^Y.hn 


Lemma 283. Under the assumptions as above, we have 


tn “t“ Hn — 


Xa-J^: 


(ji + ^0=2 
^x,a„ ^ bn^Y.bn 


.yF(0,l). 


( 10 . 2 ) 


(10.3) 


Proof. From the asymptotic properties of Xi and Yj as above and their independence, we 
receive, e.g. using Fubini’s theorem and the fact that convergence in distribution is equivalent 
to the pointwise convergence of characteristic functions, that 


Xa„ - \fpYbn ^ ^( 0 , 0 -^ + pOy). 

faX^ P 


(10.4) 


Thus, from G„ := Fb„ {./p- 


0 , 


a/T Ifon — \/pYi,^ +Gn^ X/{Q,G ^ +pu y). 
V 

Furthermore, from the continuous mapping theorem. 


(10.5) 


+ ^\,b„ Y - ^fxl + pnl. (10.6) 

Now, (10.3) follows from (10.5), (10.6), and Slutsky’s lemma. □ 

If px ^ hYy then from (10.3) and > 0, n e N+, for each a e (0,1) and Zi-a as in Remark 7, 
limsupP{f„ > zi-a) < lim P(t„ + Fr„ > zi-a) = a, (10.7) 

n^oo n^oo 

i.e. the tests of the null hypothesis px ^ Py with the regions of rejection > Zi-a, n e N+, are 
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pointwise asymptotically level a (see Definition 11.1.1 in [36]). We shall further use such tesfs 
for Zi-a = 3, so fhaf a = 0.00270. If for some selected n we have tn ^ 3, i.e. the null hypothesis 
as above can be rejected, then we shall informally say that the estimate Xa^ ± of px is 

(statistically significantly) higher than such an estimate Y^, ± of py. 

s/bn 

Most frequently, for some i.i.d. square integrable random variables t e N+, and such 
variables Y'., t e N+, independent of X'., in such tests we shall use ^ and ox,n = 

\J ^ L”=i (^,- - Xn)^, and analogously for Y„ and In such a case shall be called an 
estimate of the standard deviation of the mean 

10.2 Estimation experiments 

We first performed fc-stage minimization methods of the different estimators and for the 
different estimation problems, using n,- = 50-2'“^ samples in the ith stage for t = l,...,fc 
for various fc e N+ (see Section 7.1). We chose fc = 3 for the problem of estimating pi.afxo), 
= 5 for mgf(xo), and fc = 6 for pr.aixo) and q 2 ,aixo)- We first used a = 0.05 and M = 10 time- 
independent IS basis functions as in (5.81). For t = 1,2,...,6, the IS drifts r(bi) corresponding to 
the minimization results bi from the ith stage of the MSM of different estimators for estimating 
the translated committor q 2 ,aixo) are shown in Figure 10.1. The IS drifts corresponding to the 
final results of MSM for the estimation of all the expectations are shown in Figure 10.2. In 
figures 10.1 and 10.2 we also show for comparison approximations ofthe zero-variance IS drifts 
r* for the diffusion problems for the translated committors and MGF, computed from formula 
(5.77) using finite differences instead of derivatives and finite difference approximations of u 
in that formula computed as in Section 5.9. In Figure 10.1, the IS drifts from the consecutive 
stages of the MSM of msq2 and ic seem to converge the fastest to some limiting drift close 
to (the approximation of) r*, from the MSM of nisq — slower, and of ce — the slowest. See 
Section 10.4 for some intuitions behind these results. 

Consider a numerical experiment in which, for a given IS parameter v e A, we compute 
unbiased estimates of the mean cost c[v) as well as of the variance var(r’) and the (theoretical) 
inefficiency constant icfu) of the IS estimator of the expectation of the functional of the Euler 
scheme of interest, using estimators (4.28), (4.31), and (4.23) respectively for b = b' = v and 
n = 10. For some Pi,...,Pn i-i.d. ~ U, these estimators are evaluated on (for ^ as 

in (5.35)), so that the computations involve simulating n independent Euler schemes with 
an additional drift r[v) as in (5.57). Such experiments for the different estimation problems 
and for v equal to the final results of the MSM of ce, rnsq, msq2, and ic as above, and for 
f = 0 for CMC, were repeated independently K times in an outer MC loop for different K. For 
the problem of the estimation of < 72,0 (xq) we additionally used as v the minimization result 
from the third step of MSM. For the MCF we made in all the cases K = 75000 repetitions. 
For the translated committors, when using CMC or IS with a parameter v from the 3-stage 
MSM of estimators other than ce, we took K = 2000, while in the other cases we chose K = 
5000. For pT,a we made K = 20000 repetitions both for the CMC and when using v from 
the MSM of ce, and K = 2- 10^ repetitions for v from the MSM of tnsq, msq2, and ic. The 
MC means of the inefficiency constant and variance estimators from the outer loops for 
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(a) (b) 




(0 


(d) 


Figure 10.1: IS drifts from different stages of MSM for estimating q 2 ,a, minimizing ce in (a), 
insq in (b), msq2 in (c), and ic in (d). ’Optimal’ denotes an approximation of the zero-variance 
IS drift r*. 


the translated committors and MGF are given in Table 10.1, along with the estimates of 
the standard deviations of such means. For pT,aixo)> such outer MC loop estimates of the 
inefficiency constants, variances, and mean costs are given in Table 10.2. 

Remark 284. Note that due to Remark 64, the variables C as above have all moments (and thus 
also variance) finite under Q(v), v e A. For v = 0 (i.e. for CMC), from the boundedness of the 
considered Z, we have the finiteness of the mean costs and of the variances and inefficiency 
constants of the estimators of the Euler scheme expectations of interest as well as of the variances 
of the utilized estimators of such quantities. For the general v and bounded stopping times (as 
is the case for such times equal to t' as in (5.73) when estimating pT,a)> the finiteness of the 
quantities as in the previous sentence follows from the corresponding Z being bounded, as well 
as from Theorem 121 and Remark 119. In cases when the stopping time is not bounded (like 
for such a time equal to t for the MGF and translated committors as above), one can ensure 
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(a) 


(b) 


r for computing mgf 


r for compufing p^^ 



-4 -3 -2 



(c) 


(d) 


Figure 10.2: Final IS drifts from the MSM experiments minimizing different estimators, 
for qi^aixo) in (a), in (b), mgf(xo) in (c), and PT,ai^o) in (d). “Optimal" denotes an 

approximation of the zero-variance IS drift. 


such boundedness by terminating the simulations at some fixed time as discussed in Section 
6.2. We did not terminate our simulations, but still our results can be interpreted as coming 
from simulations terminated at some time larger than any of the exit times encountered in our 
experiments. 

From tables 10.1 and 10.2 we can see that using the IS parameters from the MSM of ic and msq2 
led in each case to the lowest estimates of variances and (theoretical) inefficiency constants, 
followed by the ones from using the parameters from the MSM of msq, and finally ce. Using 
CMC led in each case to the highest such estimates. For qz.aixo) and each of ce, rnsq, and 
msq 2, using the IS parameter from the sixth stage of MSM led to a lower estimate of variance 
and inefficiency constant than using such a parameter from the third stage. Note also that the 
estimates of the inefficiency constants and variances for q 2 ,a i-^o) when using the IS parameters 
from the third stage of the MSM of msq 2 and ic are lower than when using the parameters 
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CMC 

ce 

msq 

msq2 

ic 


Estimates of inefficiency constants (-10 

^l,a 

7204 + 92 

4855 + 649 

507 + 10 

129.1 + 3.8 

132.9 + 3.5 

q2,a< k = 3 

7300 + 88 

553 + 26 

60.8 + 1.2 

53.4+1.0 

50.5+1.1 

II 


73.1 + 1.0 

56.10 + 0.69 

48.78 + 0.59 

47.99 + 0.61 

mgf 

2041 + 15 

6.276 + 0.055 

3.691 + 0.035 

3.177 + 0.029 

3.163 + 0.028 


Estimates of variances (-10 

^l,a 

174.0 + 1.7 

126+14 

12.36 + 0.20 

3.155 + 0.083 

3.208 + 0.073 

cn 

II 

177.43+1.7 

16.35 + 0.79 

1.437 + 0.021 

1.268 + 0.019 

1.223 + 0.020 

q2,a, k = 6 


1.794 + 0.021 

1.347 + 0.013 

1.164 + 0.011 

1.169 + 0.012 

mgf 

49.41 + 0.32 

0.9040 + 0.0072 

0.3732 + 0.0029 

0.3195 + 0.0024 

0.3209 + 0.0024 


Table 10.1: Estimates of the inefficiency constants and variances of the estimators of the 
translated committors for a = 0.05 and the MGF when using CMC or IS with IS parameters 
from the MSM of different estimators. For ^ 2 ,a we consider using the IS parameters from the 
A;th stages of MSM for k e {3,6}. 



CMC 

ce 

msq 

msq 2 

ic 

ic (TO-^) 

1391 + 5 

65.22 + 0.30 

63.332 + 0.09 

62.677 + 0.090 

61.863 + 0.091 

var (TO”^) 

150.78 + 0.58 

10.258 + 0.042 

10.036 + 0.013 

9.8937 + 0.0126 

9.9491 + 0.0130 

c 

9.227 + 0.004 

6.3608 + 0.0066 

6.3117 + 0.0021 

6.3355 + 0.0021 

6.2173 + 0.0021 


Table 10.2: Estimates of the inefficiency constants and variances of the estimators of pT,a for 
a = 0.05 as well as of the mean costs, when using CMC or IS with the IS parameters from the 
MSM of different estimators. 


from the sixth stage of the MSM of ce and insq (though for the estimates of the inefficiency 
constants for msq 2 and insq we cannot confirm this at the desired significance level as in 
Section 10.1). For ppai^o), the estimate of the inefficiency constant, variance, and mean 
cost is respectively lower, higher, and lower when minimizing ic than msq 2 . Some intuitions 
behind these results are given by Theorem 209 and Remark 129, see also the discussion in 
Section 10.4. 

Using the IS parameters v from the MSM of ic as above and averaging the estimates from the 
nK simulations available in each case we computed the IS MC estimates of the quantities of 
interest: mgf(xo), and using the translated estimators as in Section 5.7 also of prixo) and qtixo), 
i = l,2. The results are presented in Table 10.3. Note that we have qi (xo) = 1 - <72 (xo) = 0.78 
and the estimates of the inefficiency constants in Table 10.1 for estimating the lower value 
committor q 2 ixo) are lower. Thus, it seems reasonable to use the translated IS estimator for 
q 2 (.xo) also for computing ^ 7 l{xo) as discussed in Section 5.7. 


qi (Xo) 

£72 (Xo) 

mgf(xo) 

Pr(xo) 

0.7751 + 0.0004 

0.22597 + 0.00025 

0.16682+(6-10“®) 

0.18396+(7-10"^) 


Table 10.3: Estimates of different expectations obtained from IS MC using IS parameters from 
the MSM of ic. 

In the above experiments utilising nK simulations we also computed the MC estimates of the 
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mean costs c{v]. For comparison we also computed an estimate of the mean cost in CMC 
(equal to c(0) = Eu(t)), using an MC average of such costs from 7.5-10^ simulations. The 
results are provided in Table 10.4. Note that the estimates of the mean costs in tables 10.2 and 
10.3 are lower for IS using the IS parameters v from the MSM methods for computing mgf(xo) 
and pT.aixo) than for the respective CMC methods. As discussed in Section 3.3, an intuition 
behind these results is provided by Theorem 16. 



‘h.aiXo) 

<l2,aiXa) 

mgf(xo) 

CMC 

c 

41.26 + 0.28 

41.18 + 0.28 

9.89 + 0.03 

41.44 + 0.15 


Table 10.4: Estimates of the mean costs when using the IS parameters from the MSM of ic and 
for CMC, for the problems of computing the translated committors for a = 0.05 and MCE 

We also performed two-stage experiments similar as above for qz.aixo) and pr.aixo) for several 
different added constants a e [R+ other than a = 0.05 considered above. For q 2 ,aixo) we used 
the IS basis functions as above, while for pr.aixo) also the time-dependent basis functions as 
in (5.82) for M = 5 and M= 10 and various peM+. This time in the first stages we performed 
the MSM only of ic for k = 3 and n, = 400 • 2'“^, i = 1,..., fc, so that the number of samples 
rijc = 1600 used in the final stages of MSM was the same as for a = 0.05 above. In the second 
stages we estimated the inefficiency constants, mean costs, and variances in an external loop 
like above. For q 2 ,aixo) we made K = 3000 repetitions in such a loop, while for pr.aixo) — 
K = 10000 for the basis functions as in (5.81), as well as K = 50000 for the basis functions as in 
(5.82) for M = 5 and K = 30000 for M = 10. The results are presented in tables 10.5, 10.6, and 
10.7, along with the results for the case of a = 0.05 considered before. The smallest estimates 
of the inefficiency constants and variances for q 2 ,aixQ) were obtained for a = 0.05. For pr.aixo] 
and the basis functions as in (5.81), we obtained the smallest variance for a = 0.2 and the 
lowest inefficiency consfants for a = 0.1 and a = 0.2. Among all the cases for pT,a> the smallest 
variances and theoretical inefficiency constants were received for a = 0 and when using the 
time-dependent IS basis functions (5.82) for M = 10 and p = 3. 


a = 

0 

0.05 

0.1 

0.2 

ic (T0“^) 
var (-lO-^) 

82.6 ±2.3 

2.008 ±0.058 

47.99 ±0.61 

1.169±0.012 

51.89 ±0.80 

1.252 ±0.016 

55.52 ±0.86 

1.359 ±0.016 


Table 10.5: Estimates of the inefficiency constants and variances of the IS estimators of q 2 ,a 
for different a, corresponding to IS with the parameters from the MSM of ic. 


a = 

0 

0.05 

0.1 

0.2 

0.3 

ic (-10-3) 
var(-10“3) 

c 

100.8 ±0.7 

17.88±0.10 

5.641 ±0.009 

61.86±0.09 

9.949 ±0.013 

6.2173 ±0.0021 

55.95 ±0.35 

8.224 ±0.047 

6.803 ±0.009 

56.44 ±0.39 

7.544 ±0.050 

7.489 ±0.009 

61.43 ±0.48 

7.794 ±0.058 

7.874 ±0.009 


Table 10.6: Estimates of the inefficiency constants and variances of the estimators of pT,a for 
different a, and estimates of the mean costs, corresponding to IS with the parameters from 
the MSM of ic and using M = 10 time-independent IS basis functions as in (5.81). 
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M = 5 

M= 10, a^O 


p-l, a-0 

p-l, a- 0.05 

p-2, a-0 

II 

w 

II 

0 

p=i 

p = 2 

p = 3 

IcTiir^) 

varC'lO^^) 

C 

63.01 ±0.39 

10.544 ±0.064 

5.978 ±0.003 

93.44 ±0.45 

13.60 ±0.06 

6.871 ±0.004 

48.18±0.28 

7.975 ±0.045 

6.037 ±0.003 

46.52 ±0.23 

7.629 ±0.036 

6.086 ±0.003 

34.11±0.16 

5.714± 0.025 

5.966 ±0.004 

22.95±0.17 

3.899 ±0.027 

5.878 ±0.004 

17.34±0.12 

2.951 ±0.019 

5.887 ±0.004 


Table 10.7: Estimates of the inefficiency constants and variances of IS estimators of pT,a for 
different a, as well as of the mean costs, corresponding to IS with the parameters from the 
MSM of ic, using the time-dependent IS basis functions as in (5.82) for different M and p. 


In our experiments, when using CMC and IS MC with different sets of IS basis functions, the 
proportionality constants p^^ as in Chapter 2 were considerably different. Thus, to compare 
the efficiency of the MC methods using estimators corresponding to these different bases, one 
should compare their practical rather than theoretical inefficiency constants. We performed 
separate experiments approximating some as above and computing the corresponding 
practical inefficiency constants (equal to the products of such p^ and the respective theoretical 
inefficiency constants). For n = 10^, we ran n-step CMC and IS MC procedures for estimating 
q' 2 ,o. 05 (^o) using the IS basis functions as in (5.81), and for estimating pT,a- for a = 0.05 for 
IS basis functions as in (5.81) for M = 10, and for a = 0 for IS basis functions as in (5.82): for 
M= 5 and p = 1, and for M = 10 and p e {1,3}. When performing the IS MC we used the IS 
parameters from the final stages of the corresponding MSM procedures as above. For C, being 
the theoretical cost of the ith step of a given MC procedure and C, being its practical cost equal 
to its computation time calculated using the matlah tic and toe functions, as an approximation 

yn 

of p^ we used the ratio Pc „= y'n^c - Treating (Q,C;), i = 1,2,..., asi.i.d. random vectors with 
square-integrahle coordinates, for P(^ := from the delta method it easily follows that for 


a:=pc] 


War(Ci) 

{E(Ci))2 


■ -t- 


Var(Ci) 

{E(Ci))2 


^ Cov(Ci,Ci) 

E(Ci)E(Ci) 


( 10 . 8 ) 


we have \/«(Pc,„ - Pq) ^ M^(0, cr^). For being an estimate of cr in which instead of means, 
variances, and covariances one uses their standard unbiased estimators computed using 
(C,-, Q)”=i, we have a.s. ^ ^ 1- Thus, from Slutsky’s lemma, ^{Pc,n ~ Pc^ ^ ^(0,1), which 
can be used for constructing asymptotic confidence intervals for P(j. In Table 10.8 we provide 
the computed estimates in form p^^ „ ± It can be seen that these approximations of Pf^ are 
close for < 72 ,a(•^o) and pr.aixo) when using in both cases CMC or IS with the basis functions as 
in (5.81), and for pr.aixo) when using the basis functions as in (5.82) for M = 10 and different 
p. However, such p^j differ significantly for the other pairs of MC methods. In Table 10.8 we 
also provide the estimates of practical inefficiency constants ic obtained by multiplying the 
corresponding estimates of the theoretical inefficiency constants computed earlier by the 
received approximations of p^. From this table we can see that using IS in the considered 
cases led to considerable practical inefficiency constant reductions over using CMC. 
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^2,a 

PT,a 1 


CMC 

o 

II 

CMC 

o 

II 

M = S,p = \ 

M= 10, p = 1 

M= 10, p = 3 

Pc (-lO-S) 
ic (-lO-^s) 

33.612 + 0.003 

245.4 + 3.0 

137.06±0.01 

6.578 + 0.084 

33.998 + 0.004 

47.3 + 0.2 

138.75 + 0.01 

8.584 + 0.013 

96.36 + 0.01 

6.072 + 0.038 

148.92 + 0.01 

5.080 + 0.024 

149.74 + 0.01 

2.597 + 0.018 


Table 10.8: Estimates of and ic for computing q 2 ,a and pT,a for various a and IS basis 
functions as discussed in the main text. 

10.3 Experiments comparing the spread of IS drifts. 

In the experiments described in this section we consider the assumptions as for the estimation 
of q 2 ,aixo) in Section 5.9. For the estimators ^t equal to each of ce, msq, nis^, and ic, we 
performed 20 independent SSM experiments for rii = 100 and h' = 0 as in Section 7.1, i.e. mini¬ 
mizing b ib', b) (xi) for some Xi as in that section. For each such experiment for ^t = ic, 

for the same Xi as in that experiment, we additionally carried out a two-phase minimization, 
in its first phase minimizing b nrsq„(h', h)(xi) and in the second b icnib', b)[xi), using 
the first-phase minimization result as a starting point. The IS drifts corresponding to the 
IS parameters computed in the above experiments are shown in Figure 10.3, in which we 
also show an approximation of the zero-variance IS drift r* for the corresponding diffusion 
problem as in the previous section. 

From Figure 10.3 (d) it can be seen that ordinary (i.e. single-phase) SSM using ic yielded 
in three experiments IS drifts far from r*, while in the other 17 experiments and in all 20 
experiments when using two-phase minimization we received drifts close to r*. In the experi¬ 
ments in which ordinary minimization led to drifts far from r*, the value of ic„(h',h)(xi) in 
the minimization result b was several times smaller than when using two-phase minimization, 
while in the other cases these values were very close (e.g. the absolute value of difference of 
such values divided by the smaller of them was in each case below 1%). In Figure 10.3, the IS 
drifts from the SSM of msq 2 and the two-phase minimization minimizing ic in the second 
phase as above seem to be the least spread, followed by the ones from the minimization of 
rnsq, and finally ce. 

From the helow remark it can he expected that for a sufficiently large n, the IS drifts from SSM 
experiments like above should have approximately normal distribution in each point. 

Remark 285. Let A, T, dt, and rt be as in Section 8.1 and let for some covariance matrix 
D e it hold 

^tidt-b*]^^i0,D). (10.9) 

This holds e.g. ford t being the SSM or MSM results of the estimators g for g replaced by ce, msq, 
msq2, or ic, for the SSM under the assumptions as in Section 8.5 for D = u{b') Vg [b’) and rt = t, 
teT, while for the MSM under the assumptions as in Section 8.7 for D = Vg{d*), T = N+, and 
I'k = k e T. Let us assume a linear parametrization of the IS drifts as in (5.58), let x e R'”, 
and letB e be such thatBij = (f;);(x), i = l,...,l, j = I ...,d. Then, r(_dt)[x) = B^ dt, teT, 
and from (10.9) 

y/Ttiridt) - r(r ))(x) ^ J7{<d,B'^DB). (10.10) 
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(0 (d) 

Figure 10.3: The IS drifts from SSM for estimating q 2 ,a, minimizing ce in (a), insq in (b), 
msq2 in (c), and ic in (d). The "optimal" IS drift is a finite-difference approximation of the 
zero-variance IS drift r*. The label ’msqic’ in (d) corresponds to the two-phase minimization 
and ’ic’ to the single-phase minimization as discussed in the main text. 


In the further experiments, to be able to carry out more simulations in a reasonable time, 
we changed the model considered by increasing the temperature 10 times. For such a new 
temperature we received an estimate 1.468 ± 0.002 of the mean cost hEuCr) in CMC, as com¬ 
pared to 41.44 ± 0.15 under the original temperature as in Table 10.3. We carried out an MSM 
procedure of msq2 for k = 6 and n, = 50 • 2'“^, i = 1,..., k, receiving the final minimization 
result bMSM- For ^t equal to each of the different estimators as above and for b’ = 0 and 
b’ = bMSM, we carried out independently N = 5000 times the SSM of ^t for rii = 200, in the ith 
SSM receiving a result at and then computing r (a,) (0), i.e. the corresponding IS drift at zero, 
i = The histograms of such IS drifts at zero for b' = 0, with fitted Gaussian functions, 

are shown in Figure 10.4. This figure suggests that the distributions of the IS drifts at zero are 
approximately normal, as could be expected from Remark 285. Furthermore, the (empirical) 
distribution of the IS drifts at zero for b' = 0 seems to be in a sense the least spread when 
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(C) (d) 


Figure 10.4: Histograms of the IS drifts at zero from the SSM for computing q 2 ,a< minimizing 
ce in (a), nisq in (b), rns^ in (c), and ic in (d). 


minimizing msq2 and ic, followed by rnsq, and hnally ce. The same observations can be made 
from the inspection of histograms for the case of b' = Bmsm, which are not shown. We shall 
now compare quantitatively the spread of empirical distributions of the IS drifts at zero in the 
above experiments for the different estimators and h' used, using interquartile ranges (IQRs), 
the definition and some required properties of which are provided in the below remark. 

Remark286. Fori.i.d. random variablesXi, Xz,and k,neN+, letXt.n be thekth coordinate 
ofXn := [Xi)"^^ in the nondecreasing order For n > 4, let us define the interquartile range (IQR) 
of the coordinates of Xfi as IQR„ = X|^3nj.^ - LetfurtherXi ~ for some 

and a e IR+, and let q denote the IQR ofjV{iJ.,a^) (i.e. the difference of its third and first 
quartile). Then, for a certain d = 1.36 we have \/n{lQR„- q) ^ JT{0,dq^) (see page 327 in [15]). 
Thus, fordn = VdlQR„ we have ^ (IQR„ -q)^ JY{b, 1), which can be used for constructing 
asymptotic confidence intervals for q. For some p' eU and a' e K+, consider further ..., 

i.i.d. ~ JT{p',{a'fi], such that (X|);eM^ is independent o/(X;);eM^, and let q' e IR+ he the 
IQR ofjYip’, Then, for IQr'„ analogous as above but for the primed variables we have 
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(IQR„ - (7,IQR^„ - q') ^ jy'iO,ddiag{q^, iq’)^)). Thus, for R = ^,Rn = andoR = Rs/^, 
from the delta method we have \/h{Rn-R) ^ JV[0,o'j^). Therefore, for aR n = RnV^ we have 
^{Rn-R)^JT{Q,l). 

R,n 

For Xi = r(fl,)(0), i = 1. N, received from the SSM of different estimators as above, we 

computed the estimates IQRjv of the IQRs of drifts at zero and the values as in Remark 286. 
The results are provided in Table 10.9. 




ce 

msq 

msq 2 

ic 

fi' 

= 0 

0.736 ±0.012 

0.3770 ±0.0062 

0.1039 ±0.0017 

0.1006 ±0.0017 

fi' 

= itMSM 

0.546 ±0.009 

0.2910 ±0.0048 

0.0932 ±0.0015 

0.0929 ±0.0015 


Table 10.9: Estimates of the IQRs of the IS drifts at zero and the values from the SSM of 
various estimators. 

From this table we can see that for the both values of b' the computed estimates of IQRs 
from the minimization of ic and msq2 are the lowest, followed by such estimates from the 
minimization of msq, and finally ce. The ratio of the estimates of IQRs from the minimization 
of ce to msq is 1.951 ± 0.045 for b' = 0 and 1.8771 ± 0.044 for b' = Bmsm (where the results 
are provided in the form Rn ± under appropriate identifications with the variables from 
Remark 286). Note that these ratios are close to 2. Intuitions supporting the above results are 
given in Section 10.4. Note also that the estimates of IQRs are lower when using b' = hnsM 
than b’ = 0. 

10.4 Some intuitions behind certain results of our numerical exper¬ 
iments 

Recall that in the numerical experiments in Section 10.2 we observed the fastest convergence 
of the IS drifts in the MSM results and in Section 10.3 the lowest spreads of such drifts in the 
SSM results when minimizing msq 2 and ic, followed by nTsq, and finally ce. Furthermore, in 
Section 10.3 the IQRs of the values at zero of the IS drifts corresponding to the SSM results were 
approximately two times higher when minimizing ce than insq. In this section we provide 
some intuitions behind these and some other of our experimental results. We will need the 
following remark. 

Remark 287. Let us assume that, similarly as in Section 8.9, for b* being a zero-variance IS 
parameter, for each b we have V^sq 2 ib) = Vidb) = 0, Vceib) = 4l4nsq(fi). titid Vceib) is 
positive definite. Let further, similarly as in Remark 285, for d = I, for g replaced by each 
ofce, msq, msq2, and ic, for x e IR'” and B = ((fd(x)) for u{b') e R+, = t, and Vg = 

u[b')B^Vgib')B for SSM or ri^ = n^ and Vg = B^Vg{d*)B for MSM, the IS drifts corresponding 
to the SSM or MSM results dt of the estimators g respectively fulfill 

s/Ttiridt) - r{b*))ix) => M^(0, Vg). (10.11) 

Then, for g replaced by msq 2 or ic we have Vg = 0 and the distribution J/'{0,Vg) has zero IQR. 
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If further riix) 0 for some thenO< Vce = 4ymsq> so thatjViO, Vce) has a positive 

IQR, which is exactly two times higher than the IQR ofJ^iO, fmsq)- 

A possible reason why we received the above mentioned experimental results is that we can 
have approximately the same relations as in the above remark between the matrices Vg{b) for 
the appropriate b in our experiments, that is the entries of Vg [b) can be much smaller for g 
equal to msq2 and ic than msq and ce, and we can have Vceib) = 4 Vmsqih). This would lead to 
approximately the same relations between the asymptotic variances of the IS drifts in different 
points and the IQRs of their asymptotic distributions as as in the above remark. 

Such approximate relations between the matrices Vg{b) can be a consequence of the IS 
distributions and densities corresponding to the minimum points of the minimized functions 
being close to the zero-variance ones, in the sense that the derivations as in Section 8.9 can he 
carried out approximately. For the estimation problems for whose diffusion counterparts there 
exist zero-variance IS drifts, like for the case of the translated committors and MGF, we also 
have the following possible intuition behind the hypothesized approximate relations between 
the matrices Vg{b) as above. For the diffusion counterparts of these estimation problems, the 
zero-variance IS drifts minimize the mean square, inefficiency constant, and cross-entropy 
among all the appropriate drifts. Furthermore, as evidenced in Figure 10.2, the diffusion 
zero-variance IS drifts can he approximated very well using linear comhinations of the IS basis 
functions considered. Thus, the diffusion IS drifts corresponding to the minimizers of the 
functions considered are likely to be close to the zero-variance ones. Therefore, using such 
drifts in the place of the zero-variance ones, the derivations as in Section 8.9 can be carried out 
approximately and we should have approximately the same relations between the matrices 
Vg{b) for the diffusion case as in Remark 287. For small stepsizes h, like the ones used in our 
numerical experiments, the matrices Vg(b) for the Euler scheme case can be expected to be 
close to their diffusion counterparts and thus we should also have approximately the same 
relations between them as above. 

For small stepsizes we can also expect the IS drifts corresponding to the minimizers of the 
functions considered for the Euler scheme case to be close to their diffusion counterparts, 
and thus, from the above discussion, also close to the diffusion zero-variance IS drifts. This 
would provide an intuition why in Figure 10.2 the IS drifts from the minimization of various 
estimators of the functions considered are close to the approximations of the zero-variance IS 
drifts for the diffusion case. 

In the experiments from Section 10.2 for computing pr.aixo), the MSM results of ic led to a 
lower estimate of the inefficiency constant than these of msq2, at the same time yielding a 
higher estimate of the variance and a lower of the mean cost. A possible intuition behind these 
results is provided by Theorem 209, from which it follows that under appropriate assumptions 
a.s. we eventually should have such relations for the corresponding functions evaluated 
on some parameters converging a.s. to the minimum point of the mean square and the 
ones minimizing the inefficiency constant (see Section 7.9 for some sufficient assumptions). 
Note, however, that this intuition fails when comparing the estimates of the variances in 
the minimi z ation results of insq and ic, as the latter were smaller in all of our estimation 
experiments. A possible factor that could have contributed to the fact that in Section 10.2 we 
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obtained the lowest estimates of the inefficiency constants and variances when minimizing 
the new estimators ic and msq2, followed by insq, and ce, is that, from the above hypothesis 
on the approximate relations of the matrices Vg{b), we may have the lowest spread of the 
distributions of the minimization results of the new estimators around the minimum points 
of variances and inefficiency constants, followed by such results for insq, and ce. We suspect 
that if sufficiently long minimization methods are performed (i.e. for a sufficiently large rii 
for SSM or k for MSM), so that the distributions of the minimization results of the estimators 
considered become much less spread around the minimum points of their corresponding 
functions, then, as suggested by Theorem 209, the minimization results of insq should typically 
lead to lower variance than these of ic. However, if the above hypothesis on the entries of 
f^msq(fi) being much smaller than these of Vmsqzlh) is correct, then, for a longer minimization, 
the minimization results of msq2 should still typically lead to lower variance than these of 
insq. This is because such results dt for msq2 would be asymptotically much more efficient 
for the minimization of variance in the different second-order senses discussed in Section 8.3. 
For instance, in the sense of the mean of the asymptotic distribution of rj(msq(df) - msq(fi*)) 
(for the appropriate rf), equal to TrCyglfiOFirnsq) for SSM or ^ Tr(Vg(d*)iTmsq) for MSM, 
being much smaller for g equal to msq2 than msq. Apart from the highest spread of the 
distributions of the minimization results when minimizing ce, another factor that could have 
contributed to the higher estimates of the variances in the minimization results of ce than 
for the mean square estimators in our experiments is that the minimum points of the cross¬ 
entropy functions are likely to be different from the ones of the mean square functions, so 
that, as discussed in Section 7.10, in such cases minimizing the mean square estimators can 
be more efficient for the minimization of variance in the first-order sense. 
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Conclusions and further ideas 


In this work we developed methods for obtaining the parameters of the IS change of measure 
adaptively via single- and multi-stage minimization of well-known estimators of cross-entropy 
and mean square, as well as of new estimators of mean square and inefficiency constant, 
ensuring their various convergence and asymptotic properties in the ECM and LETGS settings. 
It would be interesting to prove such properties of our methods, or some their modifications, 
using some other parametrizations of IS; see e.g. [37, 48] for some examples. 

We proposed criteria for comparing the first- and second-order asymptotic efficiency of 
certain stochastic optimization methods of functions, which for such functions being equal to 
inefficiency constants can be used for comparing the efficiency of methods for finding the 
adaptive parameters in the first stage of a two-stage estimation method as in Chapter 9. We 
also derived formulas for measures of the second-order asymptotic inefficiency of the above 
minimization methods of estimators. 

Let us now discuss some problems which one can face when trying to use in practice the 
minimization methods for the results of which we proved strong convergence and asymptotic 
properties, as well as possible solutions to these problems. When using gradient-based 
stopping criteria in some of these methods, one has to choose some nonnegative random 
bounds Ci or c; on the norms of the gradients in the minimization results, converging to zero 
a.s. (or, equivalently, ensure that these gradients converge to zero a.s.). If chosen too large, 
such bounds can make the minimization algorithm perform in practice no steps at all, and if 
taken too small, they can make the algorithm run longer than it can be afforded. To ensure 
the a.s. convergence of the gradients to zero in the MSM methods and that a reasonable 
computational effort is made by the minimization algorithm in each stage, for some e (1, oo), 
one can perform at least a fixed number of steps of the minimization algorithm plus an 
additional number of steps needed to make the norm of the gradient at least q times smaller 
than in the most recent step in which the final gradient was nonzero (assuming that such a 
step exists). 

As discussed in Remark 266, under appropriate assumptions, to ensure that — b* in the 
MSM methods one can choose appropriate sets Ki containing the variables bi and such that 
bi is equal to the ith minimization result di whenever di e Ki. If for some meN+ the sets Ki 
contain b* only for i > m, then the convergence of bi to b* may be very slow until i exceeds 
such an m. One can try to deal with this problem by performing some preliminary SSM or 
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MSM until the sequence of the minimization results has approximately converged to b* and 
then taking in a new MSM all the sets Ki containing some neighbourhood of the computed 
approximation of b*. 

As discussed in Section 8.7, as an alternative to using in MSM methods variables h, converging 
tob* minimizing the function / considered, it may be reasonable to choose such bi converging 
to some d* minimizing some measure of the second-order asymptotic inefficiency of djc for 
the minimization of /, assuming that such ad* exists. Such variables bi could be potentially 
obtained by minimizing some estimators of such a measure. A similar idea is to use as the 
parameter b' in SSM methods an estimate of d* minimizing the measure of inefficiency (8.26). 
For IS in which the mean theoretical cost is not constant in the function of the IS parameter, 
minimizing the inefficiency constant estimator can be asymptotically the best option as 
under appropriate assumptions it can outperform the minimization of the other estimators 
in terms of the first-order asymptotic efficiency for the minimization of the inefficiency 
constant (see Section 7.10). However, if the mean cost does not depend on the IS parameter, 
so that the inefficiency constant is proportional to the variance, then the minimization results 
of all the mean square and inefficiency constant estimators considered can converge to 
the minimum point of variance, in which case minimizing them is asymptotically equally 
efficient for the minimization of variance in the first-order sense. In such a case it may be 
reasonable to minimize the estimators whose minimization results are the most efficient 
for the minimization (e.g. using SSM or MSM) of the variance in the second-order sense, 
as discussed in Section 8.3. A possible idea is to estimate the measures of the second-order 
asymptotic inefficiency of different estimators for the minimization of variance, which can be 
combined with the estimation of the parameters d* minimizing such measures as discussed 
above. The estimators, and potentially also the estimate of d* as above, leading to the lowest 
estimates of the inefficiency measure, can be later used in a separate SSM or MSM procedure. 
In our numerical experiments, using different IS basis functions and added constants a led 
to considerably different inefficiency constants of the adaptive IS estimators. It would be 
interesting to develop adaptive methods for choosing such basis functions and constants. For 
instance, the added constant a can be chosen adaptively via minimization of the estimators of 
variance or inefficiency constant in which such an a is treated as an additional minimization 
parameter. 

In MSM, an alternative approach to the minimization of the estimators constructed using 
only the samples from the last stage, as in this work, would be to minimize some weighted 
average of such estimators from all the previous stages. In our initial numerical experiments, 
minimizing such averages typically yielded drifts farther from the approximations of the zero- 
variance IS drifts for the corresponding diffusions than the approach from this work (data not 
shown), which is why we focused on the current approach. Similarly, the mean a of interest 
could be estimated using a weighted average of the estimators from all the stages, which 
closely resembles the purely adaptive approach used in stochastic approximation methods 
[32, 4, 37,35]. For instance, under the assumptions as in Section 7.1 and denoting Sk = nt, 
such an estimator of a from the fcth stage could be (ZL(fi;_i))(j/,;). An SLLN and 

CLT for such an estimator can be proved similarly as in [35]. 
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The model which we used for the numerical experiments in this work is only a toy one. It 
would he interesting to test and compare the performance of our minimization methods of 
different estimators on some realistic molecular models, as well as on models arising in some 
other application areas of IS sampling, like computational finance and queueing theory. When 
using our methods for rare event simulation in practice one should take care to choose the 
IS parameter b equal to b’ in SSM or bo in MSM so that the considered event is not too rare 
under the IS distribution Q{b). This is because if such an event was too rare, then it would 
typically not occur at all in a reasonable simulation time. To find such a b adaptively one can 
use e.g. some MSM method in which the problem is modified in the initial stages to make the 
considered event less rare in these stages as in [47, 48, 56]. 
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